2018-04-03 20:23:33 +03:00
// SPDX-License-Identifier: GPL-2.0
2007-06-12 17:07:21 +04:00
/*
2008-09-29 23:18:18 +04:00
* Copyright ( C ) 2007 , 2008 Oracle . All rights reserved .
2007-06-12 17:07:21 +04:00
*/
2007-10-16 00:22:39 +04:00
# include <linux/sched.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2012-05-16 19:18:50 +04:00
# include <linux/rbtree.h>
2017-05-31 20:44:31 +03:00
# include <linux/mm.h>
2021-09-20 15:33:13 +03:00
# include <linux/error-injection.h>
2022-10-19 17:50:49 +03:00
# include "messages.h"
2007-02-02 17:18:22 +03:00
# include "ctree.h"
# include "disk-io.h"
2007-03-23 22:56:19 +03:00
# include "transaction.h"
2007-10-16 00:14:19 +04:00
# include "print-tree.h"
2008-06-26 00:01:30 +04:00
# include "locking.h"
2018-10-30 17:43:24 +03:00
# include "volumes.h"
btrfs: qgroup: Use delayed subtree rescan for balance
Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.
This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.
However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.
It's the race window between subtree swap and transaction commit could
cause qgroup number change.
This patch will delay the qgroup subtree scan until COW happens for the
subtree root.
So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.
Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
And after file system population, there is no other activity, so it
should be the best case scenario.
| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 10:15:17 +03:00
# include "qgroup.h"
2021-03-11 17:31:07 +03:00
# include "tree-mod-log.h"
btrfs: tree-checker: check extent buffer owner against owner rootid
Btrfs doesn't check whether the tree block respects the root owner.
This means, if a tree block referred by a parent in extent tree, but has
owner of 5, btrfs can still continue reading the tree block, as long as
it doesn't trigger other sanity checks.
Normally this is fine, but combined with the empty tree check in
check_leaf(), if we hit an empty extent tree, but the root node has
csum tree owner, we can let such extent buffer to sneak in.
Shrink the hole by:
- Do extra eb owner check at tree read time
- Make sure the root owner extent buffer exactly matches the root id.
Unfortunately we can't yet completely patch the hole, there are several
call sites can't pass all info we need:
- For reloc/log trees
Their owner is key::offset, not key::objectid.
We need the full root key to do that accurate check.
For now, we just skip the ownership check for those trees.
- For add_data_references() of relocation
That call site doesn't have any parent/ownership info, as all the
bytenrs are all from btrfs_find_all_leafs().
- For direct backref items walk
Direct backref items records the parent bytenr directly, thus unlike
indirect backref item, we don't do a full tree search.
Thus in that case, we don't have full parent owner to check.
For the later two cases, they all pass 0 as @owner_root, thus we can
skip those cases if @owner_root is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-16 03:05:58 +03:00
# include "tree-checker.h"
2022-10-19 17:50:51 +03:00
# include "fs.h"
2022-10-19 17:50:59 +03:00
# include "accessors.h"
2022-10-24 21:46:57 +03:00
# include "extent-tree.h"
2022-10-26 22:08:34 +03:00
# include "relocation.h"
2022-11-15 19:16:12 +03:00
# include "file-item.h"
2007-02-23 16:38:36 +03:00
2022-09-14 18:06:38 +03:00
static struct kmem_cache * btrfs_path_cachep ;
2007-03-16 23:20:31 +03:00
static int split_node ( struct btrfs_trans_handle * trans , struct btrfs_root
* root , struct btrfs_path * path , int level ) ;
2017-01-18 10:24:37 +03:00
static int split_leaf ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * ins_key , struct btrfs_path * path ,
int data_size , int extend ) ;
2007-10-16 00:14:19 +04:00
static int push_node_left ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct extent_buffer * dst ,
2008-04-24 18:54:32 +04:00
struct extent_buffer * src , int empty ) ;
2007-10-16 00:14:19 +04:00
static int balance_node_right ( struct btrfs_trans_handle * trans ,
struct extent_buffer * dst_buf ,
struct extent_buffer * src_buf ) ;
2007-02-21 00:40:44 +03:00
2019-08-30 14:36:09 +03:00
static const struct btrfs_csums {
u16 size ;
2020-02-27 23:00:45 +03:00
const char name [ 10 ] ;
const char driver [ 12 ] ;
2019-08-30 14:36:09 +03:00
} btrfs_csums [ ] = {
[ BTRFS_CSUM_TYPE_CRC32 ] = { . size = 4 , . name = " crc32c " } ,
2019-10-07 12:11:01 +03:00
[ BTRFS_CSUM_TYPE_XXHASH ] = { . size = 8 , . name = " xxhash64 " } ,
2019-10-07 12:11:02 +03:00
[ BTRFS_CSUM_TYPE_SHA256 ] = { . size = 32 , . name = " sha256 " } ,
2019-10-07 12:11:02 +03:00
[ BTRFS_CSUM_TYPE_BLAKE2 ] = { . size = 32 , . name = " blake2b " ,
. driver = " blake2b-256 " } ,
2019-08-30 14:36:09 +03:00
} ;
2022-11-15 19:16:11 +03:00
/*
* The leaf data grows from end - to - front in the node . this returns the address
* of the start of the last item , which is the stop of the leaf data stack .
*/
static unsigned int leaf_data_end ( const struct extent_buffer * leaf )
{
u32 nr = btrfs_header_nritems ( leaf ) ;
if ( nr = = 0 )
return BTRFS_LEAF_DATA_SIZE ( leaf - > fs_info ) ;
return btrfs_item_offset ( leaf , nr - 1 ) ;
}
2022-11-15 19:16:17 +03:00
/*
* Move data in a @ leaf ( using memmove , safe for overlapping ranges ) .
*
* @ leaf : leaf that we ' re doing a memmove on
* @ dst_offset : item data offset we ' re moving to
* @ src_offset : item data offset were ' moving from
* @ len : length of the data we ' re moving
*
* Wrapper around memmove_extent_buffer ( ) that takes into account the header on
* the leaf . The btrfs_item offset ' s start directly after the header , so we
* have to adjust any offsets to account for the header in the leaf . This
* handles that math to simplify the callers .
*/
static inline void memmove_leaf_data ( const struct extent_buffer * leaf ,
unsigned long dst_offset ,
unsigned long src_offset ,
unsigned long len )
{
2022-11-15 19:16:18 +03:00
memmove_extent_buffer ( leaf , btrfs_item_nr_offset ( leaf , 0 ) + dst_offset ,
btrfs_item_nr_offset ( leaf , 0 ) + src_offset , len ) ;
2022-11-15 19:16:17 +03:00
}
/*
* Copy item data from @ src into @ dst at the given @ offset .
*
* @ dst : destination leaf that we ' re copying into
* @ src : source leaf that we ' re copying from
* @ dst_offset : item data offset we ' re copying to
* @ src_offset : item data offset were ' copying from
* @ len : length of the data we ' re copying
*
* Wrapper around copy_extent_buffer ( ) that takes into account the header on
* the leaf . The btrfs_item offset ' s start directly after the header , so we
* have to adjust any offsets to account for the header in the leaf . This
* handles that math to simplify the callers .
*/
static inline void copy_leaf_data ( const struct extent_buffer * dst ,
const struct extent_buffer * src ,
unsigned long dst_offset ,
unsigned long src_offset , unsigned long len )
{
2022-11-15 19:16:18 +03:00
copy_extent_buffer ( dst , src , btrfs_item_nr_offset ( dst , 0 ) + dst_offset ,
btrfs_item_nr_offset ( src , 0 ) + src_offset , len ) ;
2022-11-15 19:16:17 +03:00
}
/*
* Move items in a @ leaf ( using memmove ) .
*
* @ dst : destination leaf for the items
* @ dst_item : the item nr we ' re copying into
* @ src_item : the item nr we ' re copying from
* @ nr_items : the number of items to copy
*
* Wrapper around memmove_extent_buffer ( ) that does the math to get the
* appropriate offsets into the leaf from the item numbers .
*/
static inline void memmove_leaf_items ( const struct extent_buffer * leaf ,
int dst_item , int src_item , int nr_items )
{
memmove_extent_buffer ( leaf , btrfs_item_nr_offset ( leaf , dst_item ) ,
btrfs_item_nr_offset ( leaf , src_item ) ,
nr_items * sizeof ( struct btrfs_item ) ) ;
}
/*
* Copy items from @ src into @ dst at the given @ offset .
*
* @ dst : destination leaf for the items
* @ src : source leaf for the items
* @ dst_item : the item nr we ' re copying into
* @ src_item : the item nr we ' re copying from
* @ nr_items : the number of items to copy
*
* Wrapper around copy_extent_buffer ( ) that does the math to get the
* appropriate offsets into the leaf from the item numbers .
*/
static inline void copy_leaf_items ( const struct extent_buffer * dst ,
const struct extent_buffer * src ,
int dst_item , int src_item , int nr_items )
{
copy_extent_buffer ( dst , src , btrfs_item_nr_offset ( dst , dst_item ) ,
btrfs_item_nr_offset ( src , src_item ) ,
nr_items * sizeof ( struct btrfs_item ) ) ;
}
2023-04-29 23:07:20 +03:00
/* This exists for btrfs-progs usages. */
u16 btrfs_csum_type_size ( u16 type )
{
return btrfs_csums [ type ] . size ;
}
2019-08-30 14:36:09 +03:00
int btrfs_super_csum_size ( const struct btrfs_super_block * s )
{
u16 t = btrfs_super_csum_type ( s ) ;
/*
* csum type is validated at mount time
*/
2023-04-29 23:07:20 +03:00
return btrfs_csum_type_size ( t ) ;
2019-08-30 14:36:09 +03:00
}
const char * btrfs_super_csum_name ( u16 csum_type )
{
/* csum type is validated at mount time */
return btrfs_csums [ csum_type ] . name ;
}
2019-10-08 19:41:33 +03:00
/*
* Return driver name if defined , otherwise the name that ' s also a valid driver
* name
*/
const char * btrfs_super_csum_driver ( u16 csum_type )
{
/* csum type is validated at mount time */
2020-02-27 23:00:45 +03:00
return btrfs_csums [ csum_type ] . driver [ 0 ] ?
btrfs_csums [ csum_type ] . driver :
2019-10-08 19:41:33 +03:00
btrfs_csums [ csum_type ] . name ;
}
2020-07-27 18:38:19 +03:00
size_t __attribute_const__ btrfs_get_num_csums ( void )
2019-10-07 12:11:03 +03:00
{
return ARRAY_SIZE ( btrfs_csums ) ;
}
2007-04-04 17:36:31 +04:00
struct btrfs_path * btrfs_alloc_path ( void )
2007-04-02 18:50:19 +04:00
{
2022-11-16 17:23:53 +03:00
might_sleep ( ) ;
2016-09-12 22:35:52 +03:00
return kmem_cache_zalloc ( btrfs_path_cachep , GFP_NOFS ) ;
2007-04-02 18:50:19 +04:00
}
2008-09-29 23:18:18 +04:00
/* this also releases the path */
2007-04-04 17:36:31 +04:00
void btrfs_free_path ( struct btrfs_path * p )
2007-01-26 23:51:26 +03:00
{
2010-12-26 00:22:30 +03:00
if ( ! p )
return ;
2011-04-21 03:20:15 +04:00
btrfs_release_path ( p ) ;
2007-04-04 17:36:31 +04:00
kmem_cache_free ( btrfs_path_cachep , p ) ;
2007-01-26 23:51:26 +03:00
}
2008-09-29 23:18:18 +04:00
/*
* path release drops references on the extent buffers in the path
* and it drops any locks held by this path
*
* It is safe to call this on paths that no locks or extent buffers held .
*/
2011-04-21 03:20:15 +04:00
noinline void btrfs_release_path ( struct btrfs_path * p )
2007-02-02 17:18:22 +03:00
{
int i ;
2008-06-26 00:01:30 +04:00
2007-03-13 17:46:10 +03:00
for ( i = 0 ; i < BTRFS_MAX_LEVEL ; i + + ) {
2008-06-26 00:01:31 +04:00
p - > slots [ i ] = 0 ;
2007-02-02 17:18:22 +03:00
if ( ! p - > nodes [ i ] )
2008-06-26 00:01:30 +04:00
continue ;
if ( p - > locks [ i ] ) {
2011-07-16 23:23:14 +04:00
btrfs_tree_unlock_rw ( p - > nodes [ i ] , p - > locks [ i ] ) ;
2008-06-26 00:01:30 +04:00
p - > locks [ i ] = 0 ;
}
2007-10-16 00:14:19 +04:00
free_extent_buffer ( p - > nodes [ i ] ) ;
2008-06-26 00:01:31 +04:00
p - > nodes [ i ] = NULL ;
2007-02-02 17:18:22 +03:00
}
}
2022-11-03 16:39:01 +03:00
/*
* We want the transaction abort to print stack trace only for errors where the
* cause could be a bug , eg . due to ENOSPC , and not for common errors that are
* caused by external factors .
*/
bool __cold abort_should_print_stack ( int errno )
{
switch ( errno ) {
case - EIO :
case - EROFS :
case - ENOMEM :
return false ;
}
return true ;
}
2008-09-29 23:18:18 +04:00
/*
* safely gets a reference on the root node of a tree . A lock
* is not taken , so a concurrent writer may put a different node
* at the root of the tree . See btrfs_lock_root_node for the
* looping required .
*
* The extent buffer returned by this has a reference taken , so
* it won ' t disappear . It may stop being the root of the tree
* at any time because there are no locks held .
*/
2008-06-26 00:01:30 +04:00
struct extent_buffer * btrfs_root_node ( struct btrfs_root * root )
{
struct extent_buffer * eb ;
2011-03-23 21:54:42 +03:00
2012-03-10 01:01:49 +04:00
while ( 1 ) {
rcu_read_lock ( ) ;
eb = rcu_dereference ( root - > node ) ;
/*
* RCU really hurts here , we could free up the root node because
2016-05-20 04:18:45 +03:00
* it was COWed but we may not get the new root node yet so do
2012-03-10 01:01:49 +04:00
* the inc_not_zero dance and if it doesn ' t work then
* synchronize_rcu and try again .
*/
if ( atomic_inc_not_zero ( & eb - > refs ) ) {
rcu_read_unlock ( ) ;
break ;
}
rcu_read_unlock ( ) ;
synchronize_rcu ( ) ;
}
2008-06-26 00:01:30 +04:00
return eb ;
}
2020-05-15 09:01:40 +03:00
/*
* Cowonly root ( not - shareable trees , everything not subvolume or reloc roots ) ,
* just get put onto a simple dirty list . Transaction walks this list to make
* sure they get properly updated on disk .
2008-09-29 23:18:18 +04:00
*/
2008-03-24 22:01:56 +03:00
static void add_root_to_dirty_list ( struct btrfs_root * root )
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2014-12-16 19:54:43 +03:00
if ( test_bit ( BTRFS_ROOT_DIRTY , & root - > state ) | |
! test_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) )
return ;
2016-06-23 01:54:23 +03:00
spin_lock ( & fs_info - > trans_lock ) ;
2014-12-16 19:54:43 +03:00
if ( ! test_and_set_bit ( BTRFS_ROOT_DIRTY , & root - > state ) ) {
/* Want the extent tree to be the last on the list */
2018-08-06 08:25:24 +03:00
if ( root - > root_key . objectid = = BTRFS_EXTENT_TREE_OBJECTID )
2014-12-16 19:54:43 +03:00
list_move_tail ( & root - > dirty_list ,
2016-06-23 01:54:23 +03:00
& fs_info - > dirty_cowonly_roots ) ;
2014-12-16 19:54:43 +03:00
else
list_move ( & root - > dirty_list ,
2016-06-23 01:54:23 +03:00
& fs_info - > dirty_cowonly_roots ) ;
2008-03-24 22:01:56 +03:00
}
2016-06-23 01:54:23 +03:00
spin_unlock ( & fs_info - > trans_lock ) ;
2008-03-24 22:01:56 +03:00
}
2008-09-29 23:18:18 +04:00
/*
* used by snapshot creation to make a copy of a root for a tree with
* a given objectid . The buffer with the new root node is returned in
* cow_ret , and this func returns zero on success or a negative error code .
*/
2007-12-18 04:14:01 +03:00
int btrfs_copy_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf ,
struct extent_buffer * * cow_ret , u64 new_root_objectid )
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-12-18 04:14:01 +03:00
struct extent_buffer * cow ;
int ret = 0 ;
int level ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
struct btrfs_disk_key disk_key ;
2007-12-18 04:14:01 +03:00
2020-05-15 09:01:40 +03:00
WARN_ON ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) & &
2016-06-23 01:54:23 +03:00
trans - > transid ! = fs_info - > running_transaction - > transid ) ;
2020-05-15 09:01:40 +03:00
WARN_ON ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) & &
2014-04-02 15:51:05 +04:00
trans - > transid ! = root - > last_trans ) ;
2007-12-18 04:14:01 +03:00
level = btrfs_header_level ( buf ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( level = = 0 )
btrfs_item_key ( buf , & disk_key , 0 ) ;
else
btrfs_node_key ( buf , & disk_key , 0 ) ;
2008-09-23 21:14:14 +04:00
2014-06-15 03:54:12 +04:00
cow = btrfs_alloc_tree_block ( trans , root , 0 , new_root_objectid ,
2020-08-20 18:46:07 +03:00
& disk_key , level , buf - > start , 0 ,
BTRFS_NESTING_NEW_ROOT ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( IS_ERR ( cow ) )
2007-12-18 04:14:01 +03:00
return PTR_ERR ( cow ) ;
2016-11-08 20:30:31 +03:00
copy_extent_buffer_full ( cow , buf ) ;
2007-12-18 04:14:01 +03:00
btrfs_set_header_bytenr ( cow , cow - > start ) ;
btrfs_set_header_generation ( cow , trans - > transid ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
btrfs_set_header_backref_rev ( cow , BTRFS_MIXED_BACKREF_REV ) ;
btrfs_clear_header_flag ( cow , BTRFS_HEADER_FLAG_WRITTEN |
BTRFS_HEADER_FLAG_RELOC ) ;
if ( new_root_objectid = = BTRFS_TREE_RELOC_OBJECTID )
btrfs_set_header_flag ( cow , BTRFS_HEADER_FLAG_RELOC ) ;
else
btrfs_set_header_owner ( cow , new_root_objectid ) ;
2007-12-18 04:14:01 +03:00
2018-10-30 17:43:24 +03:00
write_extent_buffer_fsid ( cow , fs_info - > fs_devices - > metadata_uuid ) ;
2008-11-18 05:11:30 +03:00
2007-12-18 04:14:01 +03:00
WARN_ON ( btrfs_header_generation ( buf ) > trans - > transid ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( new_root_objectid = = BTRFS_TREE_RELOC_OBJECTID )
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 1 ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
else
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 0 ) ;
2021-01-14 22:02:46 +03:00
if ( ret ) {
2021-02-04 17:35:44 +03:00
btrfs_tree_unlock ( cow ) ;
free_extent_buffer ( cow ) ;
2021-01-14 22:02:46 +03:00
btrfs_abort_transaction ( trans , ret ) ;
2007-12-18 04:14:01 +03:00
return ret ;
2021-01-14 22:02:46 +03:00
}
2007-12-18 04:14:01 +03:00
btrfs_mark_buffer_dirty ( cow ) ;
* cow_ret = cow ;
return 0 ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
/*
* check if the tree block can be shared by multiple trees
*/
int btrfs_block_can_be_shared ( struct btrfs_root * root ,
struct extent_buffer * buf )
{
/*
2020-05-15 09:01:40 +03:00
* Tree blocks not in shareable trees and tree roots are never shared .
* If a block was allocated after the last snapshot and the block was
* not allocated by tree relocation , we know the block is not shared .
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
*/
2020-05-15 09:01:40 +03:00
if ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) & &
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
buf ! = root - > node & & buf ! = root - > commit_root & &
( btrfs_header_generation ( buf ) < =
btrfs_root_last_snapshot ( & root - > root_item ) | |
btrfs_header_flag ( buf , BTRFS_HEADER_FLAG_RELOC ) ) )
return 1 ;
2018-06-21 09:45:00 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
return 0 ;
}
static noinline int update_ref_for_cow ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf ,
2010-05-16 18:46:25 +04:00
struct extent_buffer * cow ,
int * last_ref )
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
u64 refs ;
u64 owner ;
u64 flags ;
u64 new_flags = 0 ;
int ret ;
/*
* Backrefs update rules :
*
* Always use full backrefs for extent pointers in tree block
* allocated by tree relocation .
*
* If a shared tree block is no longer referenced by its owner
* tree ( btrfs_header_owner ( buf ) = = root - > root_key . objectid ) ,
* use full backrefs for extent pointers in tree block .
*
* If a tree block is been relocating
* ( root - > root_key . objectid = = BTRFS_TREE_RELOC_OBJECTID ) ,
* use full backrefs for extent pointers in tree block .
* The reason for this is some operations ( such as drop tree )
* are only allowed for blocks use full backrefs .
*/
if ( btrfs_block_can_be_shared ( root , buf ) ) {
2016-06-23 01:54:24 +03:00
ret = btrfs_lookup_extent_info ( trans , fs_info , buf - > start ,
2013-03-07 23:22:04 +04:00
btrfs_header_level ( buf ) , 1 ,
& refs , & flags ) ;
2011-08-09 00:20:18 +04:00
if ( ret )
return ret ;
2023-06-08 13:27:45 +03:00
if ( unlikely ( refs = = 0 ) ) {
btrfs_crit ( fs_info ,
" found 0 references for tree block at bytenr %llu level %d root %llu " ,
buf - > start , btrfs_header_level ( buf ) ,
btrfs_root_id ( root ) ) ;
ret = - EUCLEAN ;
btrfs_abort_transaction ( trans , ret ) ;
2011-08-30 01:17:04 +04:00
return ret ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
} else {
refs = 1 ;
if ( root - > root_key . objectid = = BTRFS_TREE_RELOC_OBJECTID | |
btrfs_header_backref_rev ( buf ) < BTRFS_MIXED_BACKREF_REV )
flags = BTRFS_BLOCK_FLAG_FULL_BACKREF ;
else
flags = 0 ;
}
owner = btrfs_header_owner ( buf ) ;
BUG_ON ( owner = = BTRFS_TREE_RELOC_OBJECTID & &
! ( flags & BTRFS_BLOCK_FLAG_FULL_BACKREF ) ) ;
if ( refs > 1 ) {
if ( ( owner = = root - > root_key . objectid | |
root - > root_key . objectid = = BTRFS_TREE_RELOC_OBJECTID ) & &
! ( flags & BTRFS_BLOCK_FLAG_FULL_BACKREF ) ) {
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , buf , 1 ) ;
2017-11-21 21:58:49 +03:00
if ( ret )
return ret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( root - > root_key . objectid = =
BTRFS_TREE_RELOC_OBJECTID ) {
2014-07-02 21:54:25 +04:00
ret = btrfs_dec_ref ( trans , root , buf , 0 ) ;
2017-11-21 21:58:49 +03:00
if ( ret )
return ret ;
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 1 ) ;
2017-11-21 21:58:49 +03:00
if ( ret )
return ret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
}
new_flags | = BTRFS_BLOCK_FLAG_FULL_BACKREF ;
} else {
if ( root - > root_key . objectid = =
BTRFS_TREE_RELOC_OBJECTID )
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 1 ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
else
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 0 ) ;
2017-11-21 21:58:49 +03:00
if ( ret )
return ret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
}
if ( new_flags ! = 0 ) {
2023-04-29 23:07:11 +03:00
ret = btrfs_set_disk_extent_flags ( trans , buf , new_flags ) ;
2011-08-09 00:20:18 +04:00
if ( ret )
return ret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
}
} else {
if ( flags & BTRFS_BLOCK_FLAG_FULL_BACKREF ) {
if ( root - > root_key . objectid = =
BTRFS_TREE_RELOC_OBJECTID )
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 1 ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
else
2014-07-02 21:54:25 +04:00
ret = btrfs_inc_ref ( trans , root , cow , 0 ) ;
2017-11-21 21:58:49 +03:00
if ( ret )
return ret ;
2014-07-02 21:54:25 +04:00
ret = btrfs_dec_ref ( trans , root , buf , 1 ) ;
2017-11-21 21:58:49 +03:00
if ( ret )
return ret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
}
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , buf ) ;
2010-05-16 18:46:25 +04:00
* last_ref = 1 ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
}
return 0 ;
}
2008-09-29 23:18:18 +04:00
/*
2009-01-06 05:25:51 +03:00
* does the dirty work in cow of a single block . The parent block ( if
* supplied ) is updated to point to the new cow copy . The new buffer is marked
* dirty and returned locked . If you modify the block it needs to be marked
* dirty again .
2008-09-29 23:18:18 +04:00
*
* search_start - - an allocation hint for the new block
*
2009-01-06 05:25:51 +03:00
* empty_size - - a hint that you plan on doing more cow . This is the size in
* bytes the allocator should try to find free next to the block it returns .
* This is just a hint and may be ignored by the allocator .
2008-09-29 23:18:18 +04:00
*/
2009-01-06 05:25:51 +03:00
static noinline int __btrfs_cow_block ( struct btrfs_trans_handle * trans ,
2007-10-16 00:14:19 +04:00
struct btrfs_root * root ,
struct extent_buffer * buf ,
struct extent_buffer * parent , int parent_slot ,
struct extent_buffer * * cow_ret ,
2020-08-20 18:46:03 +03:00
u64 search_start , u64 empty_size ,
enum btrfs_lock_nesting nest )
2007-03-03 00:08:05 +03:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
struct btrfs_disk_key disk_key ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * cow ;
2011-08-09 00:20:18 +04:00
int level , ret ;
2010-05-16 18:46:25 +04:00
int last_ref = 0 ;
2008-06-26 00:01:30 +04:00
int unlock_orig = 0 ;
2016-09-22 22:11:34 +03:00
u64 parent_start = 0 ;
2007-12-11 17:25:06 +03:00
2008-06-26 00:01:30 +04:00
if ( * cow_ret = = buf )
unlock_orig = 1 ;
2021-09-22 12:36:45 +03:00
btrfs_assert_tree_write_locked ( buf ) ;
2008-06-26 00:01:30 +04:00
2020-05-15 09:01:40 +03:00
WARN_ON ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) & &
2016-06-23 01:54:23 +03:00
trans - > transid ! = fs_info - > running_transaction - > transid ) ;
2020-05-15 09:01:40 +03:00
WARN_ON ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) & &
2014-04-02 15:51:05 +04:00
trans - > transid ! = root - > last_trans ) ;
2007-10-16 00:14:19 +04:00
2007-12-11 17:25:06 +03:00
level = btrfs_header_level ( buf ) ;
2008-09-23 21:14:14 +04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( level = = 0 )
btrfs_item_key ( buf , & disk_key , 0 ) ;
else
btrfs_node_key ( buf , & disk_key , 0 ) ;
2016-09-22 22:11:34 +03:00
if ( ( root - > root_key . objectid = = BTRFS_TREE_RELOC_OBJECTID ) & & parent )
parent_start = parent - > start ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 16:43:06 +03:00
cow = btrfs_alloc_tree_block ( trans , root , parent_start ,
root - > root_key . objectid , & disk_key , level ,
search_start , empty_size , nest ) ;
2007-06-22 22:16:25 +04:00
if ( IS_ERR ( cow ) )
return PTR_ERR ( cow ) ;
2007-08-08 00:15:09 +04:00
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
/* cow is set to blocking by btrfs_init_new_buffer */
2016-11-08 20:30:31 +03:00
copy_extent_buffer_full ( cow , buf ) ;
2007-10-16 00:15:53 +04:00
btrfs_set_header_bytenr ( cow , cow - > start ) ;
2007-10-16 00:14:19 +04:00
btrfs_set_header_generation ( cow , trans - > transid ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
btrfs_set_header_backref_rev ( cow , BTRFS_MIXED_BACKREF_REV ) ;
btrfs_clear_header_flag ( cow , BTRFS_HEADER_FLAG_WRITTEN |
BTRFS_HEADER_FLAG_RELOC ) ;
if ( root - > root_key . objectid = = BTRFS_TREE_RELOC_OBJECTID )
btrfs_set_header_flag ( cow , BTRFS_HEADER_FLAG_RELOC ) ;
else
btrfs_set_header_owner ( cow , root - > root_key . objectid ) ;
2007-08-08 00:15:09 +04:00
2018-10-30 17:43:24 +03:00
write_extent_buffer_fsid ( cow , fs_info - > fs_devices - > metadata_uuid ) ;
2008-11-18 05:11:30 +03:00
2011-08-09 00:20:18 +04:00
ret = update_ref_for_cow ( trans , root , buf , cow , & last_ref ) ;
2011-08-30 01:30:39 +04:00
if ( ret ) {
btrfs: cleanup cow block on error
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:
$ cat /proc/2143497/stack
[<0>] __btrfs_tree_lock+0x108/0x250
[<0>] lock_extent_buffer_for_io+0x35e/0x3a0
[<0>] btree_write_cache_pages+0x15a/0x3b0
[<0>] do_writepages+0x28/0xb0
[<0>] __writeback_single_inode+0x54/0x5c0
[<0>] writeback_sb_inodes+0x1e8/0x510
[<0>] wb_writeback+0xcc/0x440
[<0>] wb_workfn+0xd7/0x650
[<0>] process_one_work+0x236/0x560
[<0>] worker_thread+0x55/0x3c0
[<0>] kthread+0x13a/0x150
[<0>] ret_from_fork+0x1f/0x30
This is because we got an error while COWing a block, specifically here
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
ret = btrfs_reloc_cow_block(trans, root, buf, cow);
if (ret) {
btrfs_abort_transaction(trans, ret);
return ret;
}
}
[16402.241552] BTRFS: Transaction aborted (error -2)
[16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
[16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
[16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
[16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
[16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
[16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
[16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
[16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
[16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
[16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
[16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
[16402.257867] Call Trace:
[16402.258072] btrfs_cow_block+0x109/0x230
[16402.258356] btrfs_search_slot+0x530/0x9d0
[16402.258655] btrfs_lookup_file_extent+0x37/0x40
[16402.259155] __btrfs_drop_extents+0x13c/0xd60
[16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
[16402.259949] btrfs_replace_file_extents+0x190/0x820
[16402.260873] btrfs_clone+0x9ae/0xc00
[16402.261139] btrfs_extent_same_range+0x66/0x90
[16402.261771] btrfs_remap_file_range+0x353/0x3b1
[16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
[16402.262821] vfs_dedupe_file_range+0x189/0x220
[16402.263150] do_vfs_ioctl+0x552/0x700
[16402.263662] __x64_sys_ioctl+0x62/0xb0
[16402.264023] do_syscall_64+0x33/0x40
[16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16402.264862] RIP: 0033:0x7f542c7d15cb
[16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
[16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
[16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
[16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
[16402.271498] irq event stamp: 0
[16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
[16402.274338] ---[ end trace 737874a5a41a8236 ]---
[16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.276179] BTRFS info (device dm-9): forced readonly
[16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
[16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode. This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.
Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-29 15:53:54 +03:00
btrfs_tree_unlock ( cow ) ;
free_extent_buffer ( cow ) ;
2016-06-11 01:19:25 +03:00
btrfs_abort_transaction ( trans , ret ) ;
2011-08-30 01:30:39 +04:00
return ret ;
}
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 18:09:34 +04:00
2020-05-15 09:01:40 +03:00
if ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) ) {
2013-08-30 23:09:51 +04:00
ret = btrfs_reloc_cow_block ( trans , root , buf , cow ) ;
2015-08-06 16:56:58 +03:00
if ( ret ) {
btrfs: cleanup cow block on error
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:
$ cat /proc/2143497/stack
[<0>] __btrfs_tree_lock+0x108/0x250
[<0>] lock_extent_buffer_for_io+0x35e/0x3a0
[<0>] btree_write_cache_pages+0x15a/0x3b0
[<0>] do_writepages+0x28/0xb0
[<0>] __writeback_single_inode+0x54/0x5c0
[<0>] writeback_sb_inodes+0x1e8/0x510
[<0>] wb_writeback+0xcc/0x440
[<0>] wb_workfn+0xd7/0x650
[<0>] process_one_work+0x236/0x560
[<0>] worker_thread+0x55/0x3c0
[<0>] kthread+0x13a/0x150
[<0>] ret_from_fork+0x1f/0x30
This is because we got an error while COWing a block, specifically here
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
ret = btrfs_reloc_cow_block(trans, root, buf, cow);
if (ret) {
btrfs_abort_transaction(trans, ret);
return ret;
}
}
[16402.241552] BTRFS: Transaction aborted (error -2)
[16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
[16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
[16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
[16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
[16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
[16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
[16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
[16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
[16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
[16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
[16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
[16402.257867] Call Trace:
[16402.258072] btrfs_cow_block+0x109/0x230
[16402.258356] btrfs_search_slot+0x530/0x9d0
[16402.258655] btrfs_lookup_file_extent+0x37/0x40
[16402.259155] __btrfs_drop_extents+0x13c/0xd60
[16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
[16402.259949] btrfs_replace_file_extents+0x190/0x820
[16402.260873] btrfs_clone+0x9ae/0xc00
[16402.261139] btrfs_extent_same_range+0x66/0x90
[16402.261771] btrfs_remap_file_range+0x353/0x3b1
[16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
[16402.262821] vfs_dedupe_file_range+0x189/0x220
[16402.263150] do_vfs_ioctl+0x552/0x700
[16402.263662] __x64_sys_ioctl+0x62/0xb0
[16402.264023] do_syscall_64+0x33/0x40
[16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16402.264862] RIP: 0033:0x7f542c7d15cb
[16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
[16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
[16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
[16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
[16402.271498] irq event stamp: 0
[16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
[16402.274338] ---[ end trace 737874a5a41a8236 ]---
[16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.276179] BTRFS info (device dm-9): forced readonly
[16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
[16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode. This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.
Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-29 15:53:54 +03:00
btrfs_tree_unlock ( cow ) ;
free_extent_buffer ( cow ) ;
2016-06-11 01:19:25 +03:00
btrfs_abort_transaction ( trans , ret ) ;
2013-08-30 23:09:51 +04:00
return ret ;
2015-08-06 16:56:58 +03:00
}
2013-08-30 23:09:51 +04:00
}
2010-05-16 18:49:59 +04:00
2007-03-03 00:08:05 +03:00
if ( buf = = root - > node ) {
2008-06-26 00:01:30 +04:00
WARN_ON ( parent & & parent ! = buf ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( root - > root_key . objectid = = BTRFS_TREE_RELOC_OBJECTID | |
btrfs_header_backref_rev ( buf ) < BTRFS_MIXED_BACKREF_REV )
parent_start = buf - > start ;
2008-06-26 00:01:30 +04:00
2021-03-11 17:31:08 +03:00
ret = btrfs_tree_mod_log_insert_root ( root - > node , cow , true ) ;
2023-06-08 13:27:40 +03:00
if ( ret < 0 ) {
btrfs_tree_unlock ( cow ) ;
free_extent_buffer ( cow ) ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
atomic_inc ( & cow - > refs ) ;
2011-03-23 21:54:42 +03:00
rcu_assign_pointer ( root - > node , cow ) ;
2008-06-26 00:01:30 +04:00
2021-12-13 11:45:12 +03:00
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , buf ,
parent_start , last_ref ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( buf ) ;
2008-03-24 22:01:56 +03:00
add_root_to_dirty_list ( root ) ;
2007-03-03 00:08:05 +03:00
} else {
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
WARN_ON ( trans - > transid ! = btrfs_header_generation ( parent ) ) ;
2023-06-08 13:27:37 +03:00
ret = btrfs_tree_mod_log_insert_key ( parent , parent_slot ,
BTRFS_MOD_LOG_KEY_REPLACE ) ;
if ( ret ) {
btrfs_tree_unlock ( cow ) ;
free_extent_buffer ( cow ) ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2007-10-16 00:14:19 +04:00
btrfs_set_node_blockptr ( parent , parent_slot ,
2007-10-16 00:15:53 +04:00
cow - > start ) ;
2007-12-11 17:25:06 +03:00
btrfs_set_node_ptr_generation ( parent , parent_slot ,
trans - > transid ) ;
2007-03-30 22:27:56 +04:00
btrfs_mark_buffer_dirty ( parent ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
if ( last_ref ) {
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_free_eb ( buf ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
if ( ret ) {
btrfs: cleanup cow block on error
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:
$ cat /proc/2143497/stack
[<0>] __btrfs_tree_lock+0x108/0x250
[<0>] lock_extent_buffer_for_io+0x35e/0x3a0
[<0>] btree_write_cache_pages+0x15a/0x3b0
[<0>] do_writepages+0x28/0xb0
[<0>] __writeback_single_inode+0x54/0x5c0
[<0>] writeback_sb_inodes+0x1e8/0x510
[<0>] wb_writeback+0xcc/0x440
[<0>] wb_workfn+0xd7/0x650
[<0>] process_one_work+0x236/0x560
[<0>] worker_thread+0x55/0x3c0
[<0>] kthread+0x13a/0x150
[<0>] ret_from_fork+0x1f/0x30
This is because we got an error while COWing a block, specifically here
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
ret = btrfs_reloc_cow_block(trans, root, buf, cow);
if (ret) {
btrfs_abort_transaction(trans, ret);
return ret;
}
}
[16402.241552] BTRFS: Transaction aborted (error -2)
[16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
[16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
[16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
[16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
[16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
[16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
[16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
[16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
[16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
[16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
[16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
[16402.257867] Call Trace:
[16402.258072] btrfs_cow_block+0x109/0x230
[16402.258356] btrfs_search_slot+0x530/0x9d0
[16402.258655] btrfs_lookup_file_extent+0x37/0x40
[16402.259155] __btrfs_drop_extents+0x13c/0xd60
[16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
[16402.259949] btrfs_replace_file_extents+0x190/0x820
[16402.260873] btrfs_clone+0x9ae/0xc00
[16402.261139] btrfs_extent_same_range+0x66/0x90
[16402.261771] btrfs_remap_file_range+0x353/0x3b1
[16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
[16402.262821] vfs_dedupe_file_range+0x189/0x220
[16402.263150] do_vfs_ioctl+0x552/0x700
[16402.263662] __x64_sys_ioctl+0x62/0xb0
[16402.264023] do_syscall_64+0x33/0x40
[16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16402.264862] RIP: 0033:0x7f542c7d15cb
[16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
[16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
[16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
[16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
[16402.271498] irq event stamp: 0
[16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
[16402.274338] ---[ end trace 737874a5a41a8236 ]---
[16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.276179] BTRFS info (device dm-9): forced readonly
[16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
[16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode. This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.
Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-29 15:53:54 +03:00
btrfs_tree_unlock ( cow ) ;
free_extent_buffer ( cow ) ;
2016-06-11 01:19:25 +03:00
btrfs_abort_transaction ( trans , ret ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
return ret ;
}
}
2021-12-13 11:45:12 +03:00
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , buf ,
parent_start , last_ref ) ;
2007-03-03 00:08:05 +03:00
}
2008-06-26 00:01:30 +04:00
if ( unlock_orig )
btrfs_tree_unlock ( buf ) ;
2012-03-10 01:01:49 +04:00
free_extent_buffer_stale ( buf ) ;
2007-06-28 23:57:36 +04:00
btrfs_mark_buffer_dirty ( cow ) ;
2007-04-02 18:50:19 +04:00
* cow_ret = cow ;
2007-03-03 00:08:05 +03:00
return 0 ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
static inline int should_cow_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf )
{
2016-06-21 16:52:41 +03:00
if ( btrfs_is_testing ( root - > fs_info ) )
2014-05-08 01:06:09 +04:00
return 0 ;
2014-09-30 01:53:21 +04:00
2018-03-16 04:39:40 +03:00
/* Ensure we can see the FORCE_COW bit */
smp_mb__before_atomic ( ) ;
2011-11-15 05:48:06 +04:00
/*
* We do not need to cow a block if
* 1 ) this block is not created or changed in this transaction ;
* 2 ) this block does not belong to TREE_RELOC tree ;
* 3 ) the root is not forced COW .
*
* What is forced COW :
2016-05-20 04:18:45 +03:00
* when we create snapshot during committing the transaction ,
2018-11-28 14:05:13 +03:00
* after we ' ve finished copying src root , we must COW the shared
2011-11-15 05:48:06 +04:00
* block to ensure the metadata consistency .
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( btrfs_header_generation ( buf ) = = trans - > transid & &
! btrfs_header_flag ( buf , BTRFS_HEADER_FLAG_WRITTEN ) & &
! ( root - > root_key . objectid ! = BTRFS_TREE_RELOC_OBJECTID & &
2011-11-15 05:48:06 +04:00
btrfs_header_flag ( buf , BTRFS_HEADER_FLAG_RELOC ) ) & &
2014-04-02 15:51:05 +04:00
! test_bit ( BTRFS_ROOT_FORCE_COW , & root - > state ) )
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
return 0 ;
return 1 ;
}
2008-09-29 23:18:18 +04:00
/*
* cows a single block , see __btrfs_cow_block for the real work .
2016-05-20 04:18:45 +03:00
* This version of it has extra checks so that a block isn ' t COWed more than
2008-09-29 23:18:18 +04:00
* once per transaction , as long as it hasn ' t been written yet
*/
2009-01-06 05:25:51 +03:00
noinline int btrfs_cow_block ( struct btrfs_trans_handle * trans ,
2007-10-16 00:14:19 +04:00
struct btrfs_root * root , struct extent_buffer * buf ,
struct extent_buffer * parent , int parent_slot ,
2020-08-20 18:46:03 +03:00
struct extent_buffer * * cow_ret ,
enum btrfs_lock_nesting nest )
2007-08-08 00:15:09 +04:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-08-08 00:15:09 +04:00
u64 search_start ;
2007-10-16 00:14:48 +04:00
int ret ;
2008-01-08 23:46:30 +03:00
2023-09-27 14:09:22 +03:00
if ( unlikely ( test_bit ( BTRFS_ROOT_DELETING , & root - > state ) ) ) {
btrfs_abort_transaction ( trans , - EUCLEAN ) ;
btrfs_crit ( fs_info ,
" attempt to COW block %llu on root %llu that is being deleted " ,
buf - > start , btrfs_root_id ( root ) ) ;
return - EUCLEAN ;
}
2018-11-30 19:52:13 +03:00
btrfs: error out when COWing block using a stale transaction
At btrfs_cow_block() we have these checks to verify we are not using a
stale transaction (a past transaction with an unblocked state or higher),
and the only thing we do is to trigger a WARN with a message and a stack
trace. This however is a critical problem, highly unexpected and if it
happens it's most likely due to a bug, so we should error out and turn the
fs into error state so that such issue is much more easily noticed if it's
triggered.
The problem is critical because using such stale transaction will lead to
not persisting the extent buffer used for the COW operation, as allocating
a tree block adds the range of the respective extent buffer to the
->dirty_pages iotree of the transaction, and a stale transaction, in the
unlocked state or higher, will not flush dirty extent buffers anymore,
therefore resulting in not persisting the tree block and resource leaks
(not cleaning the dirty_pages iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no
transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related
and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-27 14:09:21 +03:00
/*
* COWing must happen through a running transaction , which always
* matches the current fs generation ( it ' s a transaction with a state
* less than TRANS_STATE_UNBLOCKED ) . If it doesn ' t , then turn the fs
* into error state to prevent the commit of any transaction .
*/
if ( unlikely ( trans - > transaction ! = fs_info - > running_transaction | |
trans - > transid ! = fs_info - > generation ) ) {
btrfs_abort_transaction ( trans , - EUCLEAN ) ;
btrfs_crit ( fs_info ,
" unexpected transaction when attempting to COW block %llu on root %llu, transaction %llu running transaction %llu fs generation %llu " ,
buf - > start , btrfs_root_id ( root ) , trans - > transid ,
fs_info - > running_transaction - > transid ,
fs_info - > generation ) ;
return - EUCLEAN ;
}
2008-01-08 23:46:30 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( ! should_cow_block ( trans , root , buf ) ) {
2007-08-08 00:15:09 +04:00
* cow_ret = buf ;
return 0 ;
}
2009-02-04 17:24:25 +03:00
2015-12-14 19:42:10 +03:00
search_start = buf - > start & ~ ( ( u64 ) SZ_1G - 1 ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
btrfs: qgroup: Use delayed subtree rescan for balance
Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.
This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.
However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.
It's the race window between subtree swap and transaction commit could
cause qgroup number change.
This patch will delay the qgroup subtree scan until COW happens for the
subtree root.
So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.
Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
And after file system population, there is no other activity, so it
should be the best case scenario.
| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 10:15:17 +03:00
/*
* Before CoWing this block for later modification , check if it ' s
* the subtree root and do the delayed subtree trace if needed .
*
* Also We don ' t care about the error , as it ' s handled internally .
*/
btrfs_qgroup_trace_subtree_after_cow ( trans , root , buf ) ;
2007-10-16 00:14:48 +04:00
ret = __btrfs_cow_block ( trans , root , buf , parent ,
2020-08-20 18:46:03 +03:00
parent_slot , cow_ret , search_start , 0 , nest ) ;
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 14:18:59 +03:00
trace_btrfs_cow_block ( root , buf , * cow_ret ) ;
2007-10-16 00:14:48 +04:00
return ret ;
2007-08-08 00:15:09 +04:00
}
2020-12-16 19:18:43 +03:00
ALLOW_ERROR_INJECTION ( btrfs_cow_block , ERRNO ) ;
2007-08-08 00:15:09 +04:00
2008-09-29 23:18:18 +04:00
/*
* helper function for defrag to decide if two blocks pointed to by a
* node are actually close by
*/
2007-10-16 00:17:34 +04:00
static int close_blocks ( u64 blocknr , u64 other , u32 blocksize )
2007-08-08 00:15:09 +04:00
{
2007-10-16 00:17:34 +04:00
if ( blocknr < other & & other - ( blocknr + blocksize ) < 32768 )
2007-08-08 00:15:09 +04:00
return 1 ;
2007-10-16 00:17:34 +04:00
if ( blocknr > other & & blocknr - ( other + blocksize ) < 32768 )
2007-08-08 00:15:09 +04:00
return 1 ;
return 0 ;
}
2020-06-08 17:06:07 +03:00
# ifdef __LITTLE_ENDIAN
/*
* Compare two keys , on little - endian the disk order is same as CPU order and
* we can avoid the conversion .
*/
static int comp_keys ( const struct btrfs_disk_key * disk_key ,
const struct btrfs_key * k2 )
{
const struct btrfs_key * k1 = ( const struct btrfs_key * ) disk_key ;
return btrfs_comp_cpu_keys ( k1 , k2 ) ;
}
# else
2007-11-06 18:26:24 +03:00
/*
* compare two keys in a memcmp fashion
*/
2017-01-18 10:24:37 +03:00
static int comp_keys ( const struct btrfs_disk_key * disk ,
const struct btrfs_key * k2 )
2007-11-06 18:26:24 +03:00
{
struct btrfs_key k1 ;
btrfs_disk_key_to_cpu ( & k1 , disk ) ;
2009-07-24 19:06:52 +04:00
return btrfs_comp_cpu_keys ( & k1 , k2 ) ;
2007-11-06 18:26:24 +03:00
}
2020-06-08 17:06:07 +03:00
# endif
2007-11-06 18:26:24 +03:00
2008-11-12 22:19:50 +03:00
/*
* same as comp_keys only with two btrfs_key ' s
*/
2019-10-01 20:57:39 +03:00
int __pure btrfs_comp_cpu_keys ( const struct btrfs_key * k1 , const struct btrfs_key * k2 )
2008-11-12 22:19:50 +03:00
{
if ( k1 - > objectid > k2 - > objectid )
return 1 ;
if ( k1 - > objectid < k2 - > objectid )
return - 1 ;
if ( k1 - > type > k2 - > type )
return 1 ;
if ( k1 - > type < k2 - > type )
return - 1 ;
if ( k1 - > offset > k2 - > offset )
return 1 ;
if ( k1 - > offset < k2 - > offset )
return - 1 ;
return 0 ;
}
2007-11-06 18:26:24 +03:00
2008-09-29 23:18:18 +04:00
/*
* this is used by the defrag code to go through all the
* leaves pointed to by a node and reallocate them so that
* disk order is close to key order
*/
2007-08-08 00:15:09 +04:00
int btrfs_realloc_node ( struct btrfs_trans_handle * trans ,
2007-10-16 00:14:19 +04:00
struct btrfs_root * root , struct extent_buffer * parent ,
2013-01-31 22:21:12 +04:00
int start_slot , u64 * last_ret ,
2007-10-16 00:22:39 +04:00
struct btrfs_key * progress )
2007-08-08 00:15:09 +04:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-10-16 00:17:34 +04:00
struct extent_buffer * cur ;
2007-08-08 00:15:09 +04:00
u64 blocknr ;
2007-08-10 22:06:19 +04:00
u64 search_start = * last_ret ;
u64 last_block = 0 ;
2007-08-08 00:15:09 +04:00
u64 other ;
u32 parent_nritems ;
int end_slot ;
int i ;
int err = 0 ;
2007-10-16 00:17:34 +04:00
u32 blocksize ;
2007-11-06 18:26:24 +03:00
int progress_passed = 0 ;
struct btrfs_disk_key disk_key ;
2007-08-08 00:15:09 +04:00
btrfs: error out when reallocating block for defrag using a stale transaction
At btrfs_realloc_node() we have these checks to verify we are not using a
stale transaction (a past transaction with an unblocked state or higher),
and the only thing we do is to trigger two WARN_ON(). This however is a
critical problem, highly unexpected and if it happens it's most likely due
to a bug, so we should error out and turn the fs into error state so that
such issue is much more easily noticed if it's triggered.
The problem is critical because in btrfs_realloc_node() we COW tree blocks,
and using such stale transaction will lead to not persisting the extent
buffers used for the COW operations, as allocating tree block adds the
range of the respective extent buffers to the ->dirty_pages iotree of the
transaction, and a stale transaction, in the unlocked state or higher,
will not flush dirty extent buffers anymore, therefore resulting in not
persisting the tree block and resource leaks (not cleaning the dirty_pages
iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no
transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related
and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-27 14:09:23 +03:00
/*
* COWing must happen through a running transaction , which always
* matches the current fs generation ( it ' s a transaction with a state
* less than TRANS_STATE_UNBLOCKED ) . If it doesn ' t , then turn the fs
* into error state to prevent the commit of any transaction .
*/
if ( unlikely ( trans - > transaction ! = fs_info - > running_transaction | |
trans - > transid ! = fs_info - > generation ) ) {
btrfs_abort_transaction ( trans , - EUCLEAN ) ;
btrfs_crit ( fs_info ,
" unexpected transaction when attempting to reallocate parent %llu for root %llu, transaction %llu running transaction %llu fs generation %llu " ,
parent - > start , btrfs_root_id ( root ) , trans - > transid ,
fs_info - > running_transaction - > transid ,
fs_info - > generation ) ;
return - EUCLEAN ;
}
2007-09-11 03:58:16 +04:00
2007-10-16 00:17:34 +04:00
parent_nritems = btrfs_header_nritems ( parent ) ;
2016-06-23 01:54:23 +03:00
blocksize = fs_info - > nodesize ;
2015-02-23 22:48:52 +03:00
end_slot = parent_nritems - 1 ;
2007-08-08 00:15:09 +04:00
2015-02-23 22:48:52 +03:00
if ( parent_nritems < = 1 )
2007-08-08 00:15:09 +04:00
return 0 ;
2015-02-23 22:48:52 +03:00
for ( i = start_slot ; i < = end_slot ; i + + ) {
2007-08-08 00:15:09 +04:00
int close = 1 ;
2007-10-16 00:22:39 +04:00
2007-11-06 18:26:24 +03:00
btrfs_node_key ( parent , & disk_key , i ) ;
if ( ! progress_passed & & comp_keys ( & disk_key , progress ) < 0 )
continue ;
progress_passed = 1 ;
2007-10-16 00:17:34 +04:00
blocknr = btrfs_node_blockptr ( parent , i ) ;
2007-08-10 22:06:19 +04:00
if ( last_block = = 0 )
last_block = blocknr ;
2007-10-25 23:43:18 +04:00
2007-08-08 00:15:09 +04:00
if ( i > 0 ) {
2007-10-16 00:17:34 +04:00
other = btrfs_node_blockptr ( parent , i - 1 ) ;
close = close_blocks ( blocknr , other , blocksize ) ;
2007-08-08 00:15:09 +04:00
}
2015-02-23 22:48:52 +03:00
if ( ! close & & i < end_slot ) {
2007-10-16 00:17:34 +04:00
other = btrfs_node_blockptr ( parent , i + 1 ) ;
close = close_blocks ( blocknr , other , blocksize ) ;
2007-08-08 00:15:09 +04:00
}
2007-08-10 22:06:19 +04:00
if ( close ) {
last_block = blocknr ;
2007-08-08 00:15:09 +04:00
continue ;
2007-08-10 22:06:19 +04:00
}
2007-08-08 00:15:09 +04:00
2020-11-05 18:45:10 +03:00
cur = btrfs_read_node_slot ( parent , i ) ;
if ( IS_ERR ( cur ) )
return PTR_ERR ( cur ) ;
2007-08-10 22:06:19 +04:00
if ( search_start = = 0 )
2007-10-16 00:17:34 +04:00
search_start = last_block ;
2007-08-10 22:06:19 +04:00
2008-06-26 00:01:31 +04:00
btrfs_tree_lock ( cur ) ;
2007-10-16 00:17:34 +04:00
err = __btrfs_cow_block ( trans , root , cur , parent , i ,
2008-06-26 00:01:31 +04:00
& cur , search_start ,
2007-10-16 00:17:34 +04:00
min ( 16 * blocksize ,
2020-08-20 18:46:03 +03:00
( end_slot - i ) * blocksize ) ,
BTRFS_NESTING_COW ) ;
2007-08-29 17:11:44 +04:00
if ( err ) {
2008-06-26 00:01:31 +04:00
btrfs_tree_unlock ( cur ) ;
2007-10-16 00:17:34 +04:00
free_extent_buffer ( cur ) ;
2007-08-08 00:15:09 +04:00
break ;
2007-08-29 17:11:44 +04:00
}
2008-06-26 00:01:31 +04:00
search_start = cur - > start ;
last_block = cur - > start ;
2007-08-10 22:42:37 +04:00
* last_ret = search_start ;
2008-06-26 00:01:31 +04:00
btrfs_tree_unlock ( cur ) ;
free_extent_buffer ( cur ) ;
2007-08-08 00:15:09 +04:00
}
return err ;
}
2007-02-02 19:05:29 +03:00
/*
2021-12-02 13:30:35 +03:00
* Search for a key in the given extent_buffer .
2007-10-16 00:14:19 +04:00
*
2023-02-08 20:46:49 +03:00
* The lower boundary for the search is specified by the slot number @ first_slot .
2023-02-24 06:31:26 +03:00
* Use a value of 0 to search over the whole extent buffer . Works for both
* leaves and nodes .
2007-02-02 19:05:29 +03:00
*
2021-12-02 13:30:35 +03:00
* The slot in the extent buffer is returned via @ slot . If the key exists in the
* extent buffer , then @ slot will point to the slot where the key is , otherwise
* it points to the slot where you would insert the key .
*
* Slot may point to the total number of items ( i . e . one position beyond the last
* key ) if the key is bigger than the last key in the extent buffer .
2007-02-02 19:05:29 +03:00
*/
2023-02-24 06:31:26 +03:00
int btrfs_bin_search ( struct extent_buffer * eb , int first_slot ,
const struct btrfs_key * key , int * slot )
2007-01-26 23:51:26 +03:00
{
2021-12-02 13:30:35 +03:00
unsigned long p ;
int item_size ;
2023-02-08 20:46:49 +03:00
/*
* Use unsigned types for the low and high slots , so that we get a more
* efficient division in the search loop below .
*/
u32 low = first_slot ;
u32 high = btrfs_header_nritems ( eb ) ;
2007-01-26 23:51:26 +03:00
int ret ;
2020-04-30 00:23:37 +03:00
const int key_size = sizeof ( struct btrfs_disk_key ) ;
2007-01-26 23:51:26 +03:00
2023-02-08 20:46:49 +03:00
if ( unlikely ( low > high ) ) {
2016-06-24 02:32:45 +03:00
btrfs_err ( eb - > fs_info ,
2023-02-08 20:46:49 +03:00
" %s: low (%u) > high (%u) eb %llu owner %llu level %d " ,
2016-06-24 02:32:45 +03:00
__func__ , low , high , eb - > start ,
btrfs_header_owner ( eb ) , btrfs_header_level ( eb ) ) ;
return - EINVAL ;
}
2021-12-02 13:30:35 +03:00
if ( btrfs_header_level ( eb ) = = 0 ) {
p = offsetof ( struct btrfs_leaf , items ) ;
item_size = sizeof ( struct btrfs_item ) ;
} else {
p = offsetof ( struct btrfs_node , ptrs ) ;
item_size = sizeof ( struct btrfs_key_ptr ) ;
}
2009-01-06 05:25:51 +03:00
while ( low < high ) {
2020-04-30 00:23:37 +03:00
unsigned long oip ;
unsigned long offset ;
struct btrfs_disk_key * tmp ;
struct btrfs_disk_key unaligned ;
int mid ;
2007-01-26 23:51:26 +03:00
mid = ( low + high ) / 2 ;
2007-10-16 00:14:19 +04:00
offset = p + mid * item_size ;
2020-04-30 00:23:37 +03:00
oip = offset_in_page ( offset ) ;
2007-10-16 00:14:19 +04:00
2020-04-30 00:23:37 +03:00
if ( oip + key_size < = PAGE_SIZE ) {
2020-12-02 09:48:04 +03:00
const unsigned long idx = get_eb_page_index ( offset ) ;
2020-04-30 00:23:37 +03:00
char * kaddr = page_address ( eb - > pages [ idx ] ) ;
2007-10-16 00:14:19 +04:00
2020-12-02 09:48:04 +03:00
oip = get_eb_offset_in_page ( eb , offset ) ;
2020-04-30 00:23:37 +03:00
tmp = ( struct btrfs_disk_key * ) ( kaddr + oip ) ;
2007-10-16 00:14:19 +04:00
} else {
2020-04-30 00:23:37 +03:00
read_extent_buffer ( eb , & unaligned , offset , key_size ) ;
tmp = & unaligned ;
2007-10-16 00:14:19 +04:00
}
2020-04-30 00:23:37 +03:00
2007-01-26 23:51:26 +03:00
ret = comp_keys ( tmp , key ) ;
if ( ret < 0 )
low = mid + 1 ;
else if ( ret > 0 )
high = mid ;
else {
* slot = mid ;
return 0 ;
}
}
* slot = low ;
return 1 ;
}
2023-09-08 02:09:33 +03:00
static void root_add_used_bytes ( struct btrfs_root * root )
2010-05-16 18:46:25 +04:00
{
spin_lock ( & root - > accounting_lock ) ;
btrfs_set_root_used ( & root - > root_item ,
2023-09-08 02:09:33 +03:00
btrfs_root_used ( & root - > root_item ) + root - > fs_info - > nodesize ) ;
2010-05-16 18:46:25 +04:00
spin_unlock ( & root - > accounting_lock ) ;
}
2023-09-08 02:09:33 +03:00
static void root_sub_used_bytes ( struct btrfs_root * root )
2010-05-16 18:46:25 +04:00
{
spin_lock ( & root - > accounting_lock ) ;
btrfs_set_root_used ( & root - > root_item ,
2023-09-08 02:09:33 +03:00
btrfs_root_used ( & root - > root_item ) - root - > fs_info - > nodesize ) ;
2010-05-16 18:46:25 +04:00
spin_unlock ( & root - > accounting_lock ) ;
}
2008-09-29 23:18:18 +04:00
/* given a node and slot number, this reads the blocks it points to. The
* extent buffer is returned with a reference taken ( but unlocked ) .
*/
2019-08-21 20:16:27 +03:00
struct extent_buffer * btrfs_read_node_slot ( struct extent_buffer * parent ,
int slot )
2007-03-01 20:04:21 +03:00
{
2008-05-12 20:59:19 +04:00
int level = btrfs_header_level ( parent ) ;
2022-09-14 08:32:50 +03:00
struct btrfs_tree_parent_check check = { 0 } ;
2013-04-23 22:17:42 +04:00
struct extent_buffer * eb ;
2016-07-05 22:10:14 +03:00
if ( slot < 0 | | slot > = btrfs_header_nritems ( parent ) )
return ERR_PTR ( - ENOENT ) ;
2008-05-12 20:59:19 +04:00
2023-02-07 19:57:20 +03:00
ASSERT ( level ) ;
2008-05-12 20:59:19 +04:00
2022-09-14 08:32:50 +03:00
check . level = level - 1 ;
check . transid = btrfs_node_ptr_generation ( parent , slot ) ;
check . owner_root = btrfs_header_owner ( parent ) ;
check . has_first_key = true ;
btrfs_node_key_to_cpu ( parent , & check . first_key , slot ) ;
2019-03-20 16:54:01 +03:00
eb = read_tree_block ( parent - > fs_info , btrfs_node_blockptr ( parent , slot ) ,
2022-09-14 08:32:50 +03:00
& check ) ;
2022-02-22 10:41:19 +03:00
if ( IS_ERR ( eb ) )
return eb ;
if ( ! extent_buffer_uptodate ( eb ) ) {
2016-07-05 22:10:14 +03:00
free_extent_buffer ( eb ) ;
2022-02-22 10:41:19 +03:00
return ERR_PTR ( - EIO ) ;
2013-04-23 22:17:42 +04:00
}
return eb ;
2007-03-01 20:04:21 +03:00
}
2008-09-29 23:18:18 +04:00
/*
* node level balancing , used to make sure nodes are in proper order for
* item deletion . We balance from the top down , so we have to make sure
* that a deletion won ' t leave an node completely empty later on .
*/
2008-09-06 00:13:11 +04:00
static noinline int balance_level ( struct btrfs_trans_handle * trans ,
2008-01-03 18:01:48 +03:00
struct btrfs_root * root ,
struct btrfs_path * path , int level )
2007-03-01 20:04:21 +03:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * right = NULL ;
struct extent_buffer * mid ;
struct extent_buffer * left = NULL ;
struct extent_buffer * parent = NULL ;
2007-03-01 20:04:21 +03:00
int ret = 0 ;
int wret ;
int pslot ;
int orig_slot = path - > slots [ level ] ;
2007-03-01 23:16:26 +03:00
u64 orig_ptr ;
2007-03-01 20:04:21 +03:00
2018-09-12 01:06:23 +03:00
ASSERT ( level > 0 ) ;
2007-03-01 20:04:21 +03:00
2007-10-16 00:14:19 +04:00
mid = path - > nodes [ level ] ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2020-08-20 18:46:10 +03:00
WARN_ON ( path - > locks [ level ] ! = BTRFS_WRITE_LOCK ) ;
2007-12-11 17:25:06 +03:00
WARN_ON ( btrfs_header_generation ( mid ) ! = trans - > transid ) ;
2007-03-13 16:28:32 +03:00
orig_ptr = btrfs_node_blockptr ( mid , orig_slot ) ;
2007-03-01 23:16:26 +03:00
2011-09-06 12:55:34 +04:00
if ( level < BTRFS_MAX_LEVEL - 1 ) {
2007-10-16 00:14:19 +04:00
parent = path - > nodes [ level + 1 ] ;
2011-09-06 12:55:34 +04:00
pslot = path - > slots [ level + 1 ] ;
}
2007-03-01 20:04:21 +03:00
2007-03-17 21:29:23 +03:00
/*
* deal with the case where there is only one pointer in the root
* by promoting the node below to a root
*/
2007-10-16 00:14:19 +04:00
if ( ! parent ) {
struct extent_buffer * child ;
2007-03-01 20:04:21 +03:00
2007-10-16 00:14:19 +04:00
if ( btrfs_header_nritems ( mid ) ! = 1 )
2007-03-01 20:04:21 +03:00
return 0 ;
/* promote the child to a root */
2019-08-21 20:16:27 +03:00
child = btrfs_read_node_slot ( mid , 0 ) ;
2016-07-05 22:10:14 +03:00
if ( IS_ERR ( child ) ) {
ret = PTR_ERR ( child ) ;
2023-06-08 13:27:42 +03:00
goto out ;
2011-09-01 22:27:57 +04:00
}
2008-06-26 00:01:30 +04:00
btrfs_tree_lock ( child ) ;
2020-08-20 18:46:03 +03:00
ret = btrfs_cow_block ( trans , root , child , mid , 0 , & child ,
BTRFS_NESTING_COW ) ;
2010-05-16 18:46:25 +04:00
if ( ret ) {
btrfs_tree_unlock ( child ) ;
free_extent_buffer ( child ) ;
2023-06-08 13:27:42 +03:00
goto out ;
2010-05-16 18:46:25 +04:00
}
2008-02-01 22:58:07 +03:00
2021-03-11 17:31:08 +03:00
ret = btrfs_tree_mod_log_insert_root ( root - > node , child , true ) ;
2023-06-08 13:27:41 +03:00
if ( ret < 0 ) {
btrfs_tree_unlock ( child ) ;
free_extent_buffer ( child ) ;
btrfs_abort_transaction ( trans , ret ) ;
2023-06-08 13:27:42 +03:00
goto out ;
2023-06-08 13:27:41 +03:00
}
2011-03-23 21:54:42 +03:00
rcu_assign_pointer ( root - > node , child ) ;
2008-06-26 00:01:30 +04:00
2008-03-24 22:01:56 +03:00
add_root_to_dirty_list ( root ) ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( child ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2008-06-26 00:01:30 +04:00
path - > locks [ level ] = 0 ;
2007-03-01 20:04:21 +03:00
path - > nodes [ level ] = NULL ;
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , mid ) ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( mid ) ;
2007-03-01 20:04:21 +03:00
/* once for the path */
2007-10-16 00:14:19 +04:00
free_extent_buffer ( mid ) ;
2010-05-16 18:46:25 +04:00
2023-09-08 02:09:33 +03:00
root_sub_used_bytes ( root ) ;
2021-12-13 11:45:12 +03:00
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , mid , 0 , 1 ) ;
2007-03-01 20:04:21 +03:00
/* once for the root ptr */
2012-03-10 01:01:49 +04:00
free_extent_buffer_stale ( mid ) ;
2010-05-16 18:46:25 +04:00
return 0 ;
2007-03-01 20:04:21 +03:00
}
2007-10-16 00:14:19 +04:00
if ( btrfs_header_nritems ( mid ) >
2016-06-23 01:54:23 +03:00
BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) / 4 )
2007-03-01 20:04:21 +03:00
return 0 ;
2023-02-07 19:57:21 +03:00
if ( pslot ) {
left = btrfs_read_node_slot ( parent , pslot - 1 ) ;
if ( IS_ERR ( left ) ) {
ret = PTR_ERR ( left ) ;
left = NULL ;
2023-06-08 13:27:42 +03:00
goto out ;
2023-02-07 19:57:21 +03:00
}
2016-07-05 22:10:14 +03:00
2020-08-20 18:46:04 +03:00
__btrfs_tree_lock ( left , BTRFS_NESTING_LEFT ) ;
2007-10-16 00:14:19 +04:00
wret = btrfs_cow_block ( trans , root , left ,
2020-08-20 18:46:03 +03:00
parent , pslot - 1 , & left ,
2020-08-20 18:46:05 +03:00
BTRFS_NESTING_LEFT_COW ) ;
2007-06-22 22:16:25 +04:00
if ( wret ) {
ret = wret ;
2023-06-08 13:27:42 +03:00
goto out ;
2007-06-22 22:16:25 +04:00
}
2007-08-28 00:49:44 +04:00
}
2016-07-05 22:10:14 +03:00
2023-02-07 19:57:21 +03:00
if ( pslot + 1 < btrfs_header_nritems ( parent ) ) {
right = btrfs_read_node_slot ( parent , pslot + 1 ) ;
if ( IS_ERR ( right ) ) {
ret = PTR_ERR ( right ) ;
right = NULL ;
2023-06-08 13:27:42 +03:00
goto out ;
2023-02-07 19:57:21 +03:00
}
2016-07-05 22:10:14 +03:00
2020-08-20 18:46:04 +03:00
__btrfs_tree_lock ( right , BTRFS_NESTING_RIGHT ) ;
2007-10-16 00:14:19 +04:00
wret = btrfs_cow_block ( trans , root , right ,
2020-08-20 18:46:03 +03:00
parent , pslot + 1 , & right ,
2020-08-20 18:46:05 +03:00
BTRFS_NESTING_RIGHT_COW ) ;
2007-08-28 00:49:44 +04:00
if ( wret ) {
ret = wret ;
2023-06-08 13:27:42 +03:00
goto out ;
2007-08-28 00:49:44 +04:00
}
}
/* first, try to make some room in the middle buffer */
2007-10-16 00:14:19 +04:00
if ( left ) {
orig_slot + = btrfs_header_nritems ( left ) ;
2019-03-20 16:16:45 +03:00
wret = push_node_left ( trans , left , mid , 1 ) ;
2007-03-01 23:16:26 +03:00
if ( wret < 0 )
ret = wret ;
2007-03-01 20:04:21 +03:00
}
2007-03-01 23:16:26 +03:00
/*
* then try to empty the right most buffer into the middle
*/
2007-10-16 00:14:19 +04:00
if ( right ) {
2019-03-20 16:16:45 +03:00
wret = push_node_left ( trans , mid , right , 1 ) ;
2007-06-22 22:16:25 +04:00
if ( wret < 0 & & wret ! = - ENOSPC )
2007-03-01 23:16:26 +03:00
ret = wret ;
2007-10-16 00:14:19 +04:00
if ( btrfs_header_nritems ( right ) = = 0 ) {
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , right ) ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( right ) ;
2023-06-08 13:27:49 +03:00
ret = btrfs_del_ptr ( trans , root , path , level + 1 , pslot + 1 ) ;
if ( ret < 0 ) {
free_extent_buffer_stale ( right ) ;
right = NULL ;
goto out ;
}
2023-09-08 02:09:33 +03:00
root_sub_used_bytes ( root ) ;
2021-12-13 11:45:12 +03:00
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , right ,
0 , 1 ) ;
2012-03-10 01:01:49 +04:00
free_extent_buffer_stale ( right ) ;
2010-05-16 18:46:25 +04:00
right = NULL ;
2007-03-01 20:04:21 +03:00
} else {
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key right_key ;
btrfs_node_key ( right , & right_key , 0 ) ;
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( parent , pslot + 1 ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_REPLACE ) ;
2023-06-08 13:27:41 +03:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
2023-06-08 13:27:42 +03:00
goto out ;
2023-06-08 13:27:41 +03:00
}
2007-10-16 00:14:19 +04:00
btrfs_set_node_key ( parent , & right_key , pslot + 1 ) ;
btrfs_mark_buffer_dirty ( parent ) ;
2007-03-01 20:04:21 +03:00
}
}
2007-10-16 00:14:19 +04:00
if ( btrfs_header_nritems ( mid ) = = 1 ) {
2007-03-01 23:16:26 +03:00
/*
* we ' re not allowed to leave a node with one item in the
* tree during a delete . A deletion from lower in the tree
* could try to delete the only pointer in this node .
* So , pull some keys from the left .
* There has to be a left pointer at this point because
* otherwise we would have pulled some pointers from the
* right
*/
2023-06-08 13:27:44 +03:00
if ( unlikely ( ! left ) ) {
btrfs_crit ( fs_info ,
" missing left child when middle child only has 1 item, parent bytenr %llu level %d mid bytenr %llu root %llu " ,
parent - > start , btrfs_header_level ( parent ) ,
mid - > start , btrfs_root_id ( root ) ) ;
ret = - EUCLEAN ;
btrfs_abort_transaction ( trans , ret ) ;
2023-06-08 13:27:42 +03:00
goto out ;
2011-09-01 22:27:57 +04:00
}
2019-03-20 16:18:06 +03:00
wret = balance_node_right ( trans , mid , left ) ;
2007-06-22 22:16:25 +04:00
if ( wret < 0 ) {
2007-03-01 23:16:26 +03:00
ret = wret ;
2023-06-08 13:27:42 +03:00
goto out ;
2007-06-22 22:16:25 +04:00
}
2008-04-24 22:42:46 +04:00
if ( wret = = 1 ) {
2019-03-20 16:16:45 +03:00
wret = push_node_left ( trans , left , mid , 1 ) ;
2008-04-24 22:42:46 +04:00
if ( wret < 0 )
ret = wret ;
}
2007-03-01 23:16:26 +03:00
BUG_ON ( wret = = 1 ) ;
}
2007-10-16 00:14:19 +04:00
if ( btrfs_header_nritems ( mid ) = = 0 ) {
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , mid ) ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( mid ) ;
2023-06-08 13:27:49 +03:00
ret = btrfs_del_ptr ( trans , root , path , level + 1 , pslot ) ;
if ( ret < 0 ) {
free_extent_buffer_stale ( mid ) ;
mid = NULL ;
goto out ;
}
2023-09-08 02:09:33 +03:00
root_sub_used_bytes ( root ) ;
2021-12-13 11:45:12 +03:00
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , mid , 0 , 1 ) ;
2012-03-10 01:01:49 +04:00
free_extent_buffer_stale ( mid ) ;
2010-05-16 18:46:25 +04:00
mid = NULL ;
2007-03-01 23:16:26 +03:00
} else {
/* update the parent key to reflect our changes */
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key mid_key ;
btrfs_node_key ( mid , & mid_key , 0 ) ;
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( parent , pslot ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_REPLACE ) ;
2023-06-08 13:27:41 +03:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
2023-06-08 13:27:42 +03:00
goto out ;
2023-06-08 13:27:41 +03:00
}
2007-10-16 00:14:19 +04:00
btrfs_set_node_key ( parent , & mid_key , pslot ) ;
btrfs_mark_buffer_dirty ( parent ) ;
2007-03-01 23:16:26 +03:00
}
2007-03-01 20:04:21 +03:00
2007-03-01 23:16:26 +03:00
/* update the path */
2007-10-16 00:14:19 +04:00
if ( left ) {
if ( btrfs_header_nritems ( left ) > orig_slot ) {
2019-10-08 14:28:47 +03:00
atomic_inc ( & left - > refs ) ;
2008-06-26 00:01:30 +04:00
/* left was locked after cow */
2007-10-16 00:14:19 +04:00
path - > nodes [ level ] = left ;
2007-03-01 20:04:21 +03:00
path - > slots [ level + 1 ] - = 1 ;
path - > slots [ level ] = orig_slot ;
2008-06-26 00:01:30 +04:00
if ( mid ) {
btrfs_tree_unlock ( mid ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( mid ) ;
2008-06-26 00:01:30 +04:00
}
2007-03-01 20:04:21 +03:00
} else {
2007-10-16 00:14:19 +04:00
orig_slot - = btrfs_header_nritems ( left ) ;
2007-03-01 20:04:21 +03:00
path - > slots [ level ] = orig_slot ;
}
}
2007-03-01 23:16:26 +03:00
/* double check we haven't messed things up */
2007-03-22 19:13:20 +03:00
if ( orig_ptr ! =
2007-10-16 00:14:19 +04:00
btrfs_node_blockptr ( path - > nodes [ level ] , path - > slots [ level ] ) )
2007-03-01 23:16:26 +03:00
BUG ( ) ;
2023-06-08 13:27:42 +03:00
out :
2008-06-26 00:01:30 +04:00
if ( right ) {
btrfs_tree_unlock ( right ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( right ) ;
2008-06-26 00:01:30 +04:00
}
if ( left ) {
if ( path - > nodes [ level ] ! = left )
btrfs_tree_unlock ( left ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( left ) ;
2008-06-26 00:01:30 +04:00
}
2007-03-01 20:04:21 +03:00
return ret ;
}
2008-09-29 23:18:18 +04:00
/* Node balancing for insertion. Here we only split or push nodes around
* when they are completely full . This is also done top down , so we
* have to be pessimistic .
*/
2009-01-06 05:25:51 +03:00
static noinline int push_nodes_for_insert ( struct btrfs_trans_handle * trans ,
2008-01-03 18:01:48 +03:00
struct btrfs_root * root ,
struct btrfs_path * path , int level )
2007-04-20 21:16:02 +04:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * right = NULL ;
struct extent_buffer * mid ;
struct extent_buffer * left = NULL ;
struct extent_buffer * parent = NULL ;
2007-04-20 21:16:02 +04:00
int ret = 0 ;
int wret ;
int pslot ;
int orig_slot = path - > slots [ level ] ;
if ( level = = 0 )
return 1 ;
2007-10-16 00:14:19 +04:00
mid = path - > nodes [ level ] ;
2007-12-11 17:25:06 +03:00
WARN_ON ( btrfs_header_generation ( mid ) ! = trans - > transid ) ;
2007-04-20 21:16:02 +04:00
2011-09-06 12:55:34 +04:00
if ( level < BTRFS_MAX_LEVEL - 1 ) {
2007-10-16 00:14:19 +04:00
parent = path - > nodes [ level + 1 ] ;
2011-09-06 12:55:34 +04:00
pslot = path - > slots [ level + 1 ] ;
}
2007-04-20 21:16:02 +04:00
2007-10-16 00:14:19 +04:00
if ( ! parent )
2007-04-20 21:16:02 +04:00
return 1 ;
/* first, try to make some room in the middle buffer */
2023-02-07 19:57:21 +03:00
if ( pslot ) {
2007-04-20 21:16:02 +04:00
u32 left_nr ;
2008-06-26 00:01:30 +04:00
2023-02-07 19:57:21 +03:00
left = btrfs_read_node_slot ( parent , pslot - 1 ) ;
if ( IS_ERR ( left ) )
return PTR_ERR ( left ) ;
2020-08-20 18:46:04 +03:00
__btrfs_tree_lock ( left , BTRFS_NESTING_LEFT ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2007-10-16 00:14:19 +04:00
left_nr = btrfs_header_nritems ( left ) ;
2016-06-23 01:54:23 +03:00
if ( left_nr > = BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) - 1 ) {
2007-04-20 21:48:57 +04:00
wret = 1 ;
} else {
2007-10-16 00:14:19 +04:00
ret = btrfs_cow_block ( trans , root , left , parent ,
2020-08-20 18:46:03 +03:00
pslot - 1 , & left ,
2020-08-20 18:46:05 +03:00
BTRFS_NESTING_LEFT_COW ) ;
2007-06-22 22:16:25 +04:00
if ( ret )
wret = 1 ;
else {
2019-03-20 16:16:45 +03:00
wret = push_node_left ( trans , left , mid , 0 ) ;
2007-06-22 22:16:25 +04:00
}
2007-04-20 21:48:57 +04:00
}
2007-04-20 21:16:02 +04:00
if ( wret < 0 )
ret = wret ;
if ( wret = = 0 ) {
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
2007-04-20 21:16:02 +04:00
orig_slot + = left_nr ;
2007-10-16 00:14:19 +04:00
btrfs_node_key ( mid , & disk_key , 0 ) ;
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( parent , pslot ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_REPLACE ) ;
2023-06-08 13:27:46 +03:00
if ( ret < 0 ) {
btrfs_tree_unlock ( left ) ;
free_extent_buffer ( left ) ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2007-10-16 00:14:19 +04:00
btrfs_set_node_key ( parent , & disk_key , pslot ) ;
btrfs_mark_buffer_dirty ( parent ) ;
if ( btrfs_header_nritems ( left ) > orig_slot ) {
path - > nodes [ level ] = left ;
2007-04-20 21:16:02 +04:00
path - > slots [ level + 1 ] - = 1 ;
path - > slots [ level ] = orig_slot ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( mid ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( mid ) ;
2007-04-20 21:16:02 +04:00
} else {
orig_slot - =
2007-10-16 00:14:19 +04:00
btrfs_header_nritems ( left ) ;
2007-04-20 21:16:02 +04:00
path - > slots [ level ] = orig_slot ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( left ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( left ) ;
2007-04-20 21:16:02 +04:00
}
return 0 ;
}
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( left ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( left ) ;
2007-04-20 21:16:02 +04:00
}
/*
* then try to empty the right most buffer into the middle
*/
2023-02-07 19:57:21 +03:00
if ( pslot + 1 < btrfs_header_nritems ( parent ) ) {
2007-04-20 21:48:57 +04:00
u32 right_nr ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2023-02-07 19:57:21 +03:00
right = btrfs_read_node_slot ( parent , pslot + 1 ) ;
if ( IS_ERR ( right ) )
return PTR_ERR ( right ) ;
2020-08-20 18:46:04 +03:00
__btrfs_tree_lock ( right , BTRFS_NESTING_RIGHT ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2007-10-16 00:14:19 +04:00
right_nr = btrfs_header_nritems ( right ) ;
2016-06-23 01:54:23 +03:00
if ( right_nr > = BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) - 1 ) {
2007-04-20 21:48:57 +04:00
wret = 1 ;
} else {
2007-10-16 00:14:19 +04:00
ret = btrfs_cow_block ( trans , root , right ,
parent , pslot + 1 ,
2020-08-20 18:46:05 +03:00
& right , BTRFS_NESTING_RIGHT_COW ) ;
2007-06-22 22:16:25 +04:00
if ( ret )
wret = 1 ;
else {
2019-03-20 16:18:06 +03:00
wret = balance_node_right ( trans , right , mid ) ;
2007-06-22 22:16:25 +04:00
}
2007-04-20 21:48:57 +04:00
}
2007-04-20 21:16:02 +04:00
if ( wret < 0 )
ret = wret ;
if ( wret = = 0 ) {
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
btrfs_node_key ( right , & disk_key , 0 ) ;
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( parent , pslot + 1 ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_REPLACE ) ;
2023-06-08 13:27:46 +03:00
if ( ret < 0 ) {
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2007-10-16 00:14:19 +04:00
btrfs_set_node_key ( parent , & disk_key , pslot + 1 ) ;
btrfs_mark_buffer_dirty ( parent ) ;
if ( btrfs_header_nritems ( mid ) < = orig_slot ) {
path - > nodes [ level ] = right ;
2007-04-20 21:16:02 +04:00
path - > slots [ level + 1 ] + = 1 ;
path - > slots [ level ] = orig_slot -
2007-10-16 00:14:19 +04:00
btrfs_header_nritems ( mid ) ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( mid ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( mid ) ;
2007-04-20 21:16:02 +04:00
} else {
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( right ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( right ) ;
2007-04-20 21:16:02 +04:00
}
return 0 ;
}
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( right ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( right ) ;
2007-04-20 21:16:02 +04:00
}
return 1 ;
}
2007-08-07 23:52:22 +04:00
/*
2008-09-29 23:18:18 +04:00
* readahead one full node of leaves , finding things that are close
* to the block in ' slot ' , and triggering ra on them .
2007-08-07 23:52:22 +04:00
*/
2016-06-23 01:54:24 +03:00
static void reada_for_search ( struct btrfs_fs_info * fs_info ,
2009-04-03 18:14:18 +04:00
struct btrfs_path * path ,
int level , int slot , u64 objectid )
2007-08-07 23:52:22 +04:00
{
2007-10-16 00:14:19 +04:00
struct extent_buffer * node ;
2007-12-22 00:24:26 +03:00
struct btrfs_disk_key disk_key ;
2007-08-07 23:52:22 +04:00
u32 nritems ;
u64 search ;
2009-01-22 17:23:10 +03:00
u64 target ;
2007-10-16 00:17:34 +04:00
u64 nread = 0 ;
2021-03-31 13:56:21 +03:00
u64 nread_max ;
2007-10-16 00:17:34 +04:00
u32 nr ;
u32 blocksize ;
u32 nscan = 0 ;
2007-10-16 00:15:53 +04:00
2021-03-31 13:56:21 +03:00
if ( level ! = 1 & & path - > reada ! = READA_FORWARD_ALWAYS )
2007-08-08 00:15:09 +04:00
return ;
if ( ! path - > nodes [ level ] )
2007-08-07 23:52:22 +04:00
return ;
2007-10-16 00:14:19 +04:00
node = path - > nodes [ level ] ;
2008-06-26 00:01:30 +04:00
2021-03-31 13:56:21 +03:00
/*
* Since the time between visiting leaves is much shorter than the time
* between visiting nodes , limit read ahead of nodes to 1 , to avoid too
* much IO at once ( possibly random ) .
*/
if ( path - > reada = = READA_FORWARD_ALWAYS ) {
if ( level > 1 )
nread_max = node - > fs_info - > nodesize ;
else
nread_max = SZ_128K ;
} else {
nread_max = SZ_64K ;
}
2007-08-07 23:52:22 +04:00
search = btrfs_node_blockptr ( node , slot ) ;
2016-06-23 01:54:23 +03:00
blocksize = fs_info - > nodesize ;
btrfs: continue readahead of siblings even if target node is in memory
At reada_for_search(), when attempting to readahead a node or leaf's
siblings, we skip the readahead of the siblings if the node/leaf is
already in memory. That is probably fine for the READA_FORWARD and
READA_BACK readahead types, as they are used on contexts where we
end up reading some consecutive leaves, but usually not the whole btree.
However for a READA_FORWARD_ALWAYS mode, currently only used for full
send operations, it does not make sense to skip the readahead if the
target node or leaf is already loaded in memory, since we know the caller
is visiting every node and leaf of the btree in ascending order.
So change the behaviour to not skip the readahead when the target node is
already in memory and the readahead mode is READA_FORWARD_ALWAYS.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk, with 32GiB of RAM and
using a non-debug kernel config (Debian's default config).
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
file_count=2000000
add_files $file_count 0 4
echo
echo "Creating snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
The duration of the full send operations, in seconds, were the following:
Before this change: 85 seconds
After this change: 76 seconds (-11.2%)
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-20 18:03:03 +03:00
if ( path - > reada ! = READA_FORWARD_ALWAYS ) {
struct extent_buffer * eb ;
eb = find_extent_buffer ( fs_info , search ) ;
if ( eb ) {
free_extent_buffer ( eb ) ;
return ;
}
2007-08-07 23:52:22 +04:00
}
2009-01-22 17:23:10 +03:00
target = search ;
2007-10-16 00:17:34 +04:00
2007-10-16 00:14:19 +04:00
nritems = btrfs_header_nritems ( node ) ;
2007-10-16 00:17:34 +04:00
nr = slot ;
2011-06-08 22:36:54 +04:00
2009-01-06 05:25:51 +03:00
while ( 1 ) {
2015-11-27 18:31:35 +03:00
if ( path - > reada = = READA_BACK ) {
2007-10-16 00:17:34 +04:00
if ( nr = = 0 )
break ;
nr - - ;
2021-03-31 13:56:21 +03:00
} else if ( path - > reada = = READA_FORWARD | |
path - > reada = = READA_FORWARD_ALWAYS ) {
2007-10-16 00:17:34 +04:00
nr + + ;
if ( nr > = nritems )
break ;
2007-08-07 23:52:22 +04:00
}
2015-11-27 18:31:35 +03:00
if ( path - > reada = = READA_BACK & & objectid ) {
2007-12-22 00:24:26 +03:00
btrfs_node_key ( node , & disk_key , nr ) ;
if ( btrfs_disk_key_objectid ( & disk_key ) ! = objectid )
break ;
}
2007-10-16 00:17:34 +04:00
search = btrfs_node_blockptr ( node , nr ) ;
2021-03-31 13:56:21 +03:00
if ( path - > reada = = READA_FORWARD_ALWAYS | |
( search < = target & & target - search < = 65536 ) | |
2009-01-22 17:23:10 +03:00
( search > target & & search - target < = 65536 ) ) {
2020-11-05 18:45:09 +03:00
btrfs_readahead_node_child ( node , nr ) ;
2007-10-16 00:17:34 +04:00
nread + = blocksize ;
}
nscan + + ;
2021-03-31 13:56:21 +03:00
if ( nread > nread_max | | nscan > 32 )
2007-10-16 00:17:34 +04:00
break ;
2007-08-07 23:52:22 +04:00
}
}
2008-06-26 00:01:30 +04:00
2020-11-05 18:45:09 +03:00
static noinline void reada_for_balance ( struct btrfs_path * path , int level )
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
{
2020-11-05 18:45:09 +03:00
struct extent_buffer * parent ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
int slot ;
int nritems ;
2009-04-20 23:50:10 +04:00
parent = path - > nodes [ level + 1 ] ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
if ( ! parent )
2013-06-17 22:23:02 +04:00
return ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
nritems = btrfs_header_nritems ( parent ) ;
2009-04-20 23:50:10 +04:00
slot = path - > slots [ level + 1 ] ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2020-11-05 18:45:09 +03:00
if ( slot > 0 )
btrfs_readahead_node_child ( parent , slot - 1 ) ;
if ( slot + 1 < nritems )
btrfs_readahead_node_child ( parent , slot + 1 ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
}
2008-09-29 23:18:18 +04:00
/*
2009-01-06 05:25:51 +03:00
* when we walk down the tree , it is usually safe to unlock the higher layers
* in the tree . The exceptions are when our path goes through slot 0 , because
* operations on the tree might require changing key pointers higher up in the
* tree .
2008-09-29 23:18:18 +04:00
*
2009-01-06 05:25:51 +03:00
* callers might also have set path - > keep_locks , which tells this code to keep
* the lock if the path points to the last slot in the block . This is part of
* walking through the tree , and selecting the next slot in the higher block .
2008-09-29 23:18:18 +04:00
*
2009-01-06 05:25:51 +03:00
* lowest_unlock sets the lowest level in the tree we ' re allowed to unlock . so
* if lowest_unlock is 1 , level 0 won ' t be unlocked
2008-09-29 23:18:18 +04:00
*/
2008-09-06 00:13:11 +04:00
static noinline void unlock_up ( struct btrfs_path * path , int level ,
2012-03-19 23:54:38 +04:00
int lowest_unlock , int min_write_lock_level ,
int * write_lock_level )
2008-06-26 00:01:30 +04:00
{
int i ;
int skip_level = level ;
2021-12-14 16:39:39 +03:00
bool check_skip = true ;
2008-06-26 00:01:30 +04:00
for ( i = level ; i < BTRFS_MAX_LEVEL ; i + + ) {
if ( ! path - > nodes [ i ] )
break ;
if ( ! path - > locks [ i ] )
break ;
2021-12-14 16:39:39 +03:00
if ( check_skip ) {
if ( path - > slots [ i ] = = 0 ) {
2008-06-26 00:01:30 +04:00
skip_level = i + 1 ;
continue ;
}
2021-12-14 16:39:39 +03:00
if ( path - > keep_locks ) {
u32 nritems ;
nritems = btrfs_header_nritems ( path - > nodes [ i ] ) ;
if ( nritems < 1 | | path - > slots [ i ] > = nritems - 1 ) {
skip_level = i + 1 ;
continue ;
}
}
2008-06-26 00:01:30 +04:00
}
2008-06-26 00:01:30 +04:00
2018-05-18 06:00:24 +03:00
if ( i > = lowest_unlock & & i > skip_level ) {
2021-12-14 16:39:39 +03:00
check_skip = false ;
btrfs_tree_unlock_rw ( path - > nodes [ i ] , path - > locks [ i ] ) ;
2008-06-26 00:01:30 +04:00
path - > locks [ i ] = 0 ;
2012-03-19 23:54:38 +04:00
if ( write_lock_level & &
i > min_write_lock_level & &
i < = * write_lock_level ) {
* write_lock_level = i - 1 ;
}
2008-06-26 00:01:30 +04:00
}
}
}
2009-04-03 18:14:18 +04:00
/*
2022-03-11 14:35:33 +03:00
* Helper function for btrfs_search_slot ( ) and other functions that do a search
* on a btree . The goal is to find a tree block in the cache ( the radix tree at
* fs_info - > buffer_radix ) , but if we can ' t find it , or it ' s not up to date , read
* its pages from disk .
2009-04-03 18:14:18 +04:00
*
2022-03-11 14:35:33 +03:00
* Returns - EAGAIN , with the path unlocked , if the caller needs to repeat the
* whole btree search , starting again from the current root node .
2009-04-03 18:14:18 +04:00
*/
static int
2017-01-30 23:23:42 +03:00
read_block_for_search ( struct btrfs_root * root , struct btrfs_path * p ,
struct extent_buffer * * eb_ret , int level , int slot ,
2017-02-10 20:44:32 +03:00
const struct btrfs_key * key )
2009-04-03 18:14:18 +04:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2022-09-14 08:32:50 +03:00
struct btrfs_tree_parent_check check = { 0 } ;
2009-04-03 18:14:18 +04:00
u64 blocknr ;
u64 gen ;
struct extent_buffer * tmp ;
2009-05-14 21:24:30 +04:00
int ret ;
2018-03-29 04:08:11 +03:00
int parent_level ;
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
bool unlock_up ;
2009-04-03 18:14:18 +04:00
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
unlock_up = ( ( level + 1 < BTRFS_MAX_LEVEL ) & & p - > locks [ level + 1 ] ) ;
2020-05-27 13:10:59 +03:00
blocknr = btrfs_node_blockptr ( * eb_ret , slot ) ;
gen = btrfs_node_ptr_generation ( * eb_ret , slot ) ;
parent_level = btrfs_header_level ( * eb_ret ) ;
2022-09-14 08:32:50 +03:00
btrfs_node_key_to_cpu ( * eb_ret , & check . first_key , slot ) ;
check . has_first_key = true ;
check . level = parent_level - 1 ;
check . transid = gen ;
check . owner_root = root - > root_key . objectid ;
2009-04-03 18:14:18 +04:00
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
/*
* If we need to read an extent buffer from disk and we are holding locks
* on upper level nodes , we unlock all the upper nodes before reading the
* extent buffer , and then return - EAGAIN to the caller as it needs to
* restart the search . We don ' t release the lock on the current level
* because we need to walk this node to figure out which blocks to read .
*/
2016-06-23 01:54:23 +03:00
tmp = find_extent_buffer ( fs_info , blocknr ) ;
2010-10-24 19:01:27 +04:00
if ( tmp ) {
2021-03-31 13:56:21 +03:00
if ( p - > reada = = READA_FORWARD_ALWAYS )
reada_for_search ( fs_info , p , level , slot , key - > objectid ) ;
2012-05-06 15:23:47 +04:00
/* first we do an atomic uptodate check */
2013-06-17 21:44:48 +04:00
if ( btrfs_buffer_uptodate ( tmp , gen , 1 ) > 0 ) {
btrfs: Check the first key and level for cached extent buffer
[BUG]
When reading a file from a fuzzed image, kernel can panic like:
BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3500!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
Call Trace:
btrfs_lookup_csum+0x52/0x150 [btrfs]
__btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
btrfs_submit_bio_hook+0x103/0x170 [btrfs]
submit_one_bio+0x59/0x80 [btrfs]
extent_read_full_page+0x58/0x80 [btrfs]
generic_file_read_iter+0x2f6/0x9d0
__vfs_read+0x14d/0x1a0
vfs_read+0x8d/0x140
ksys_read+0x52/0xc0
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
The fuzzed image has a corrupted leaf whose first key doesn't match its
parent:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
...
key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19
leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
leaf 29761536 flags 0x1(WRITTEN) backref revision 1
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
range start 8798638964736 end 8798641262592 length 2297856
When reading the above tree block, we have extent_buffer->refs = 2 in
the context:
- initial one from __alloc_extent_buffer()
alloc_extent_buffer()
|- __alloc_extent_buffer()
|- atomic_set(&eb->refs, 1)
- one being added to fs_info->buffer_radix
alloc_extent_buffer()
|- check_buffer_tree_ref()
|- atomic_inc(&eb->refs)
So if even we call free_extent_buffer() in read_tree_block or other
similar situation, we only decrease the refs by 1, it doesn't reach 0
and won't be freed right now.
The staled eb and its corrupted content will still be kept cached.
Furthermore, we have several extra cases where we either don't do first
key check or the check is not proper for all callers:
- scrub
We just don't have first key in this context.
- shared tree block
One tree block can be shared by several snapshot/subvolume trees.
In that case, the first key check for one subvolume doesn't apply to
another.
So for the above reasons, a corrupted extent buffer can sneak into the
buffer cache.
[FIX]
Call verify_level_key in read_block_for_search to do another
verification. For that purpose the function is exported.
Due to above reasons, although we can free corrupted extent buffer from
cache, we still need the check in read_block_for_search(), for scrub and
shared tree blocks.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-12 12:10:40 +03:00
/*
* Do extra check for first_key , eb can be stale due to
* being cached , read from scrub , or have multiple
* parents ( shared tree blocks ) .
*/
2019-03-20 16:58:13 +03:00
if ( btrfs_verify_level_key ( tmp ,
2022-09-14 08:32:50 +03:00
parent_level - 1 , & check . first_key , gen ) ) {
btrfs: Check the first key and level for cached extent buffer
[BUG]
When reading a file from a fuzzed image, kernel can panic like:
BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3500!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
Call Trace:
btrfs_lookup_csum+0x52/0x150 [btrfs]
__btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
btrfs_submit_bio_hook+0x103/0x170 [btrfs]
submit_one_bio+0x59/0x80 [btrfs]
extent_read_full_page+0x58/0x80 [btrfs]
generic_file_read_iter+0x2f6/0x9d0
__vfs_read+0x14d/0x1a0
vfs_read+0x8d/0x140
ksys_read+0x52/0xc0
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
The fuzzed image has a corrupted leaf whose first key doesn't match its
parent:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
...
key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19
leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
leaf 29761536 flags 0x1(WRITTEN) backref revision 1
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
range start 8798638964736 end 8798641262592 length 2297856
When reading the above tree block, we have extent_buffer->refs = 2 in
the context:
- initial one from __alloc_extent_buffer()
alloc_extent_buffer()
|- __alloc_extent_buffer()
|- atomic_set(&eb->refs, 1)
- one being added to fs_info->buffer_radix
alloc_extent_buffer()
|- check_buffer_tree_ref()
|- atomic_inc(&eb->refs)
So if even we call free_extent_buffer() in read_tree_block or other
similar situation, we only decrease the refs by 1, it doesn't reach 0
and won't be freed right now.
The staled eb and its corrupted content will still be kept cached.
Furthermore, we have several extra cases where we either don't do first
key check or the check is not proper for all callers:
- scrub
We just don't have first key in this context.
- shared tree block
One tree block can be shared by several snapshot/subvolume trees.
In that case, the first key check for one subvolume doesn't apply to
another.
So for the above reasons, a corrupted extent buffer can sneak into the
buffer cache.
[FIX]
Call verify_level_key in read_block_for_search to do another
verification. For that purpose the function is exported.
Due to above reasons, although we can free corrupted extent buffer from
cache, we still need the check in read_block_for_search(), for scrub and
shared tree blocks.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-12 12:10:40 +03:00
free_extent_buffer ( tmp ) ;
return - EUCLEAN ;
}
2013-06-17 21:44:48 +04:00
* eb_ret = tmp ;
return 0 ;
}
2022-09-12 22:27:42 +03:00
if ( p - > nowait ) {
free_extent_buffer ( tmp ) ;
return - EAGAIN ;
}
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
if ( unlock_up )
btrfs_unlock_up_safe ( p , level + 1 ) ;
2013-06-17 21:44:48 +04:00
/* now we're allowed to do a blocking uptodate check */
2022-09-14 08:32:50 +03:00
ret = btrfs_read_extent_buffer ( tmp , & check ) ;
2022-02-22 10:41:20 +03:00
if ( ret ) {
free_extent_buffer ( tmp ) ;
btrfs_release_path ( p ) ;
return - EIO ;
2010-10-24 19:01:27 +04:00
}
btrfs: tree-checker: check extent buffer owner against owner rootid
Btrfs doesn't check whether the tree block respects the root owner.
This means, if a tree block referred by a parent in extent tree, but has
owner of 5, btrfs can still continue reading the tree block, as long as
it doesn't trigger other sanity checks.
Normally this is fine, but combined with the empty tree check in
check_leaf(), if we hit an empty extent tree, but the root node has
csum tree owner, we can let such extent buffer to sneak in.
Shrink the hole by:
- Do extra eb owner check at tree read time
- Make sure the root owner extent buffer exactly matches the root id.
Unfortunately we can't yet completely patch the hole, there are several
call sites can't pass all info we need:
- For reloc/log trees
Their owner is key::offset, not key::objectid.
We need the full root key to do that accurate check.
For now, we just skip the ownership check for those trees.
- For add_data_references() of relocation
That call site doesn't have any parent/ownership info, as all the
bytenrs are all from btrfs_find_all_leafs().
- For direct backref items walk
Direct backref items records the parent bytenr directly, thus unlike
indirect backref item, we don't do a full tree search.
Thus in that case, we don't have full parent owner to check.
For the later two cases, they all pass 0 as @owner_root, thus we can
skip those cases if @owner_root is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-16 03:05:58 +03:00
if ( btrfs_check_eb_owner ( tmp , root - > root_key . objectid ) ) {
free_extent_buffer ( tmp ) ;
btrfs_release_path ( p ) ;
return - EUCLEAN ;
}
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
if ( unlock_up )
ret = - EAGAIN ;
goto out ;
2022-09-12 22:27:42 +03:00
} else if ( p - > nowait ) {
return - EAGAIN ;
2009-04-03 18:14:18 +04:00
}
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
if ( unlock_up ) {
btrfs: avoid unnecessary btree search restarts when reading node
When reading a btree node, at read_block_for_search(), if we don't find
the node's (or leaf) extent buffer in the cache, we will read it from
disk. Since that requires waiting on IO, we release all upper level nodes
from our path before reading the target node/leaf, and then return -EAGAIN
to the caller, which will make the caller restart the while btree search.
However we are causing the restart of btree search even for cases where
it is not necessary:
1) We have a path with ->skip_locking set to true, typically when doing
a search on a commit root, so we are never holding locks on any node;
2) We are doing a read search (the "ins_len" argument passed to
btrfs_search_slot() is 0), or we are doing a search to modify an
existing key (the "cow" argument passed to btrfs_search_slot() has
a value of 1 and "ins_len" is 0), in which case we never hold locks
for upper level nodes;
3) We are doing a search to insert or delete a key, in which case we may
or may not have upper level nodes locked. That depends on the current
minimum write lock levels at btrfs_search_slot(), if we had to split
or merge parent nodes, if we had to COW upper level nodes and if
we ever visited slot 0 of an upper level node. It's still common to
not have upper level nodes locked, but our current node must be at
least at level 1, for insertions, or at least at level 2 for deletions.
In these cases when we have locks on upper level nodes, they are always
write locks.
These cases where we are not holding locks on upper level nodes far
outweigh the cases where we are holding locks, so it's completely wasteful
to retry the whole search when we have no upper nodes locked.
So change the logic to not return -EAGAIN, and make the caller retry the
search, when we don't have the parent node locked - when it's not locked
it means no other upper level nodes are locked as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:31 +03:00
btrfs_unlock_up_safe ( p , level + 1 ) ;
ret = - EAGAIN ;
} else {
ret = 0 ;
}
2009-04-20 23:50:10 +04:00
2015-11-27 18:31:35 +03:00
if ( p - > reada ! = READA_NONE )
2016-06-23 01:54:24 +03:00
reada_for_search ( fs_info , p , level , slot , key - > objectid ) ;
2009-04-03 18:14:18 +04:00
2022-09-14 08:32:50 +03:00
tmp = read_tree_block ( fs_info , blocknr , & check ) ;
2022-02-22 10:41:19 +03:00
if ( IS_ERR ( tmp ) ) {
btrfs_release_path ( p ) ;
return PTR_ERR ( tmp ) ;
2009-05-14 21:24:30 +04:00
}
2022-02-22 10:41:19 +03:00
/*
* If the read above didn ' t mark this buffer up to date ,
* it will never end up being up to date . Set ret to EIO now
* and give up so that our caller doesn ' t loop forever
* on our EAGAINs .
*/
if ( ! extent_buffer_uptodate ( tmp ) )
ret = - EIO ;
btrfs: fix reading stale metadata blocks after degraded raid1 mounts
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.
btrfs_search_slot()
read_block_for_search() # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block() # read child
Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block. It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.
A simple PoC is included in btrfs/124,
0. A two-disk raid1 btrfs,
1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
2. Mount this filesystem and put it in use, after a while, device tree's
root got COW but block A hasn't been allocated/overwritten yet.
3. Umount it and reload the btrfs module to remove both disks from the
global @fs_devices list.
4. mount -odegraded dev1 and write some data, so now block A is allocated
to be a leaf in checksum tree. Note that only dev1 has the latest
metadata of this filesystem.
5. Umount it and mount it again normally (with both disks), since raid1
can pick up one disk by the writer task's pid, if btrfs_search_slot()
needs to read block A, dev2 which does NOT have the latest metadata
might be read for block A, then we got a stale block A.
6. As parent transid is not checked, block A is marked as uptodate and
put into the extent buffer cache, so the future search won't bother
to read disk again, which means it'll make changes on this stale
one and make it dirty and flush it onto disk.
To avoid the problem, parent transid needs to be passed to
read_tree_block().
In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.
This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.
Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-15 20:37:36 +03:00
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:32 +03:00
out :
btrfs: avoid unnecessary btree search restarts when reading node
When reading a btree node, at read_block_for_search(), if we don't find
the node's (or leaf) extent buffer in the cache, we will read it from
disk. Since that requires waiting on IO, we release all upper level nodes
from our path before reading the target node/leaf, and then return -EAGAIN
to the caller, which will make the caller restart the while btree search.
However we are causing the restart of btree search even for cases where
it is not necessary:
1) We have a path with ->skip_locking set to true, typically when doing
a search on a commit root, so we are never holding locks on any node;
2) We are doing a read search (the "ins_len" argument passed to
btrfs_search_slot() is 0), or we are doing a search to modify an
existing key (the "cow" argument passed to btrfs_search_slot() has
a value of 1 and "ins_len" is 0), in which case we never hold locks
for upper level nodes;
3) We are doing a search to insert or delete a key, in which case we may
or may not have upper level nodes locked. That depends on the current
minimum write lock levels at btrfs_search_slot(), if we had to split
or merge parent nodes, if we had to COW upper level nodes and if
we ever visited slot 0 of an upper level node. It's still common to
not have upper level nodes locked, but our current node must be at
least at level 1, for insertions, or at least at level 2 for deletions.
In these cases when we have locks on upper level nodes, they are always
write locks.
These cases where we are not holding locks on upper level nodes far
outweigh the cases where we are holding locks, so it's completely wasteful
to retry the whole search when we have no upper nodes locked.
So change the logic to not return -EAGAIN, and make the caller retry the
search, when we don't have the parent node locked - when it's not locked
it means no other upper level nodes are locked as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-11 14:35:31 +03:00
if ( ret = = 0 ) {
* eb_ret = tmp ;
} else {
free_extent_buffer ( tmp ) ;
btrfs_release_path ( p ) ;
}
2009-05-14 21:24:30 +04:00
return ret ;
2009-04-03 18:14:18 +04:00
}
/*
* helper function for btrfs_search_slot . This does all of the checks
* for node - level blocks and does any balancing required based on
* the ins_len .
*
* If no extra work was required , zero is returned . If we had to
* drop the path , - EAGAIN is returned and btrfs_search_slot must
* start over
*/
static int
setup_nodes_for_search ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct btrfs_path * p ,
2011-07-16 23:23:14 +04:00
struct extent_buffer * b , int level , int ins_len ,
int * write_lock_level )
2009-04-03 18:14:18 +04:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2020-11-13 10:29:40 +03:00
int ret = 0 ;
2016-06-23 01:54:23 +03:00
2009-04-03 18:14:18 +04:00
if ( ( p - > search_for_split | | ins_len > 0 ) & & btrfs_header_nritems ( b ) > =
2016-06-23 01:54:23 +03:00
BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) - 3 ) {
2009-04-03 18:14:18 +04:00
2011-07-16 23:23:14 +04:00
if ( * write_lock_level < level + 1 ) {
* write_lock_level = level + 1 ;
btrfs_release_path ( p ) ;
2020-11-13 10:29:40 +03:00
return - EAGAIN ;
2011-07-16 23:23:14 +04:00
}
2020-11-05 18:45:09 +03:00
reada_for_balance ( p , level ) ;
2020-11-13 10:29:40 +03:00
ret = split_node ( trans , root , p , level ) ;
2009-04-03 18:14:18 +04:00
b = p - > nodes [ level ] ;
} else if ( ins_len < 0 & & btrfs_header_nritems ( b ) <
2016-06-23 01:54:23 +03:00
BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) / 2 ) {
2009-04-03 18:14:18 +04:00
2011-07-16 23:23:14 +04:00
if ( * write_lock_level < level + 1 ) {
* write_lock_level = level + 1 ;
btrfs_release_path ( p ) ;
2020-11-13 10:29:40 +03:00
return - EAGAIN ;
2011-07-16 23:23:14 +04:00
}
2020-11-05 18:45:09 +03:00
reada_for_balance ( p , level ) ;
2020-11-13 10:29:40 +03:00
ret = balance_level ( trans , root , p , level ) ;
if ( ret )
return ret ;
2009-04-03 18:14:18 +04:00
b = p - > nodes [ level ] ;
if ( ! b ) {
2011-04-21 03:20:15 +04:00
btrfs_release_path ( p ) ;
2020-11-13 10:29:40 +03:00
return - EAGAIN ;
2009-04-03 18:14:18 +04:00
}
BUG_ON ( btrfs_header_nritems ( b ) = = 1 ) ;
}
return ret ;
}
2015-01-02 20:45:16 +03:00
int btrfs_find_item ( struct btrfs_root * fs_root , struct btrfs_path * path ,
2013-11-05 07:33:33 +04:00
u64 iobjectid , u64 ioff , u8 key_type ,
struct btrfs_key * found_key )
{
int ret ;
struct btrfs_key key ;
struct extent_buffer * eb ;
2015-01-02 20:45:16 +03:00
ASSERT ( path ) ;
2015-01-02 21:36:14 +03:00
ASSERT ( found_key ) ;
2013-11-05 07:33:33 +04:00
key . type = key_type ;
key . objectid = iobjectid ;
key . offset = ioff ;
ret = btrfs_search_slot ( NULL , fs_root , & key , path , 0 , 0 ) ;
2015-01-02 21:36:14 +03:00
if ( ret < 0 )
2013-11-05 07:33:33 +04:00
return ret ;
eb = path - > nodes [ 0 ] ;
if ( ret & & path - > slots [ 0 ] > = btrfs_header_nritems ( eb ) ) {
ret = btrfs_next_leaf ( fs_root , path ) ;
if ( ret )
return ret ;
eb = path - > nodes [ 0 ] ;
}
btrfs_item_key_to_cpu ( eb , found_key , path - > slots [ 0 ] ) ;
if ( found_key - > type ! = key . type | |
found_key - > objectid ! = key . objectid )
return 1 ;
return 0 ;
}
2018-05-18 06:00:21 +03:00
static struct extent_buffer * btrfs_search_slot_get_root ( struct btrfs_root * root ,
struct btrfs_path * p ,
int write_lock_level )
{
struct extent_buffer * b ;
2021-11-24 22:14:24 +03:00
int root_lock = 0 ;
2018-05-18 06:00:21 +03:00
int level = 0 ;
if ( p - > search_commit_root ) {
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
b = root - > commit_root ;
atomic_inc ( & b - > refs ) ;
2018-12-11 13:19:45 +03:00
level = btrfs_header_level ( b ) ;
2018-05-29 16:27:06 +03:00
/*
* Ensure that all callers have set skip_locking when
* p - > search_commit_root = 1.
*/
ASSERT ( p - > skip_locking = = 1 ) ;
2018-05-18 06:00:21 +03:00
goto out ;
}
if ( p - > skip_locking ) {
b = btrfs_root_node ( root ) ;
level = btrfs_header_level ( b ) ;
goto out ;
}
2021-11-24 22:14:24 +03:00
/* We try very hard to do read locks on the root */
root_lock = BTRFS_READ_LOCK ;
2018-05-18 06:00:21 +03:00
/*
2018-05-18 06:00:23 +03:00
* If the level is set to maximum , we can skip trying to get the read
* lock .
2018-05-18 06:00:21 +03:00
*/
2018-05-18 06:00:23 +03:00
if ( write_lock_level < BTRFS_MAX_LEVEL ) {
/*
* We don ' t know the level of the root node until we actually
* have it read locked
*/
2022-09-12 22:27:42 +03:00
if ( p - > nowait ) {
b = btrfs_try_read_lock_root_node ( root ) ;
if ( IS_ERR ( b ) )
return b ;
} else {
b = btrfs_read_lock_root_node ( root ) ;
}
2018-05-18 06:00:23 +03:00
level = btrfs_header_level ( b ) ;
if ( level > write_lock_level )
goto out ;
/* Whoops, must trade for write lock */
btrfs_tree_read_unlock ( b ) ;
free_extent_buffer ( b ) ;
}
2018-05-18 06:00:21 +03:00
b = btrfs_lock_root_node ( root ) ;
root_lock = BTRFS_WRITE_LOCK ;
/* The level might have changed, check again */
level = btrfs_header_level ( b ) ;
out :
2021-11-24 22:14:24 +03:00
/*
* The root may have failed to write out at some point , and thus is no
* longer valid , return an error in this case .
*/
if ( ! extent_buffer_uptodate ( b ) ) {
if ( root_lock )
btrfs_tree_unlock_rw ( b , root_lock ) ;
free_extent_buffer ( b ) ;
return ERR_PTR ( - EIO ) ;
}
2018-05-18 06:00:21 +03:00
p - > nodes [ level ] = b ;
if ( ! p - > skip_locking )
p - > locks [ level ] = root_lock ;
/*
* Callers are responsible for dropping b ' s references .
*/
return b ;
}
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
/*
* Replace the extent buffer at the lowest level of the path with a cloned
* version . The purpose is to be able to use it safely , after releasing the
* commit root semaphore , even if relocation is happening in parallel , the
* transaction used for relocation is committed and the extent buffer is
* reallocated in the next transaction .
*
* This is used in a context where the caller does not prevent transaction
* commits from happening , either by holding a transaction handle or holding
* some lock , while it ' s doing searches through a commit root .
* At the moment it ' s only used for send operations .
*/
static int finish_need_commit_sem_search ( struct btrfs_path * path )
{
const int i = path - > lowest_level ;
const int slot = path - > slots [ i ] ;
struct extent_buffer * lowest = path - > nodes [ i ] ;
struct extent_buffer * clone ;
ASSERT ( path - > need_commit_sem ) ;
if ( ! lowest )
return 0 ;
lockdep_assert_held_read ( & lowest - > fs_info - > commit_root_sem ) ;
clone = btrfs_clone_extent_buffer ( lowest ) ;
if ( ! clone )
return - ENOMEM ;
btrfs_release_path ( path ) ;
path - > nodes [ i ] = clone ;
path - > slots [ i ] = slot ;
return 0 ;
}
2018-05-18 06:00:21 +03:00
btrfs: try to unlock parent nodes earlier when inserting a key
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.
However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.
In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.
So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).
The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:
$ cat run-fsmark.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 165273.6 5958381
0 2400000 0 190938.3 6284477
0 3600000 0 181429.1 6044059
0 4800000 0 173979.2 6223418
0 6000000 0 139288.0 6384560
0 7200000 0 163000.4 6520083
1 8400000 0 57799.2 5388544
1 9600000 0 66461.6 5552969
2 10800000 0 49593.5 5163675
2 12000000 0 57672.1 4889398
After this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 167987.3 (+1.6%) 6272730
0 2400000 0 198563.9 (+4.0%) 6048847
0 3600000 0 197436.6 (+8.8%) 6163637
0 4800000 0 202880.7 (+16.6%) 6371771
1 6000000 0 167275.9 (+20.1%) 6556733
1 7200000 0 204051.2 (+25.2%) 6817091
1 8400000 0 69622.8 (+20.5%) 5525675
1 9600000 0 69384.5 (+4.4%) 5700723
1 10800000 0 61454.1 (+23.9%) 5363754
3 12000000 0 61908.7 (+7.3%) 5370196
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-02 13:30:36 +03:00
static inline int search_for_key_slot ( struct extent_buffer * eb ,
int search_low_slot ,
const struct btrfs_key * key ,
int prev_cmp ,
int * slot )
{
/*
* If a previous call to btrfs_bin_search ( ) on a parent node returned an
* exact match ( prev_cmp = = 0 ) , we can safely assume the target key will
* always be at slot 0 on lower levels , since each key pointer
* ( struct btrfs_key_ptr ) refers to the lowest key accessible from the
* subtree it points to . Thus we can skip searching lower levels .
*/
if ( prev_cmp = = 0 ) {
* slot = 0 ;
return 0 ;
}
2023-02-24 06:31:26 +03:00
return btrfs_bin_search ( eb , search_low_slot , key , slot ) ;
btrfs: try to unlock parent nodes earlier when inserting a key
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.
However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.
In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.
So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).
The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:
$ cat run-fsmark.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 165273.6 5958381
0 2400000 0 190938.3 6284477
0 3600000 0 181429.1 6044059
0 4800000 0 173979.2 6223418
0 6000000 0 139288.0 6384560
0 7200000 0 163000.4 6520083
1 8400000 0 57799.2 5388544
1 9600000 0 66461.6 5552969
2 10800000 0 49593.5 5163675
2 12000000 0 57672.1 4889398
After this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 167987.3 (+1.6%) 6272730
0 2400000 0 198563.9 (+4.0%) 6048847
0 3600000 0 197436.6 (+8.8%) 6163637
0 4800000 0 202880.7 (+16.6%) 6371771
1 6000000 0 167275.9 (+20.1%) 6556733
1 7200000 0 204051.2 (+25.2%) 6817091
1 8400000 0 69622.8 (+20.5%) 5525675
1 9600000 0 69384.5 (+4.4%) 5700723
1 10800000 0 61454.1 (+23.9%) 5363754
3 12000000 0 61908.7 (+7.3%) 5370196
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-02 13:30:36 +03:00
}
2021-12-02 13:30:38 +03:00
static int search_leaf ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
const struct btrfs_key * key ,
struct btrfs_path * path ,
int ins_len ,
int prev_cmp )
{
struct extent_buffer * leaf = path - > nodes [ 0 ] ;
int leaf_free_space = - 1 ;
int search_low_slot = 0 ;
int ret ;
bool do_bin_search = true ;
/*
* If we are doing an insertion , the leaf has enough free space and the
* destination slot for the key is not slot 0 , then we can unlock our
* write lock on the parent , and any other upper nodes , before doing the
* binary search on the leaf ( with search_for_key_slot ( ) ) , allowing other
* tasks to lock the parent and any other upper nodes .
*/
if ( ins_len > 0 ) {
/*
* Cache the leaf free space , since we will need it later and it
* will not change until then .
*/
leaf_free_space = btrfs_leaf_free_space ( leaf ) ;
/*
* ! path - > locks [ 1 ] means we have a single node tree , the leaf is
* the root of the tree .
*/
if ( path - > locks [ 1 ] & & leaf_free_space > = ins_len ) {
struct btrfs_disk_key first_key ;
ASSERT ( btrfs_header_nritems ( leaf ) > 0 ) ;
btrfs_item_key ( leaf , & first_key , 0 ) ;
/*
* Doing the extra comparison with the first key is cheap ,
* taking into account that the first key is very likely
* already in a cache line because it immediately follows
* the extent buffer ' s header and we have recently accessed
* the header ' s level field .
*/
ret = comp_keys ( & first_key , key ) ;
if ( ret < 0 ) {
/*
* The first key is smaller than the key we want
* to insert , so we are safe to unlock all upper
* nodes and we have to do the binary search .
*
* We do use btrfs_unlock_up_safe ( ) and not
* unlock_up ( ) because the later does not unlock
* nodes with a slot of 0 - we can safely unlock
* any node even if its slot is 0 since in this
* case the key does not end up at slot 0 of the
* leaf and there ' s no need to split the leaf .
*/
btrfs_unlock_up_safe ( path , 1 ) ;
search_low_slot = 1 ;
} else {
/*
* The first key is > = then the key we want to
* insert , so we can skip the binary search as
* the target key will be at slot 0.
*
* We can not unlock upper nodes when the key is
* less than the first key , because we will need
* to update the key at slot 0 of the parent node
* and possibly of other upper nodes too .
* If the key matches the first key , then we can
* unlock all the upper nodes , using
* btrfs_unlock_up_safe ( ) instead of unlock_up ( )
* as stated above .
*/
if ( ret = = 0 )
btrfs_unlock_up_safe ( path , 1 ) ;
/*
* ret is already 0 or 1 , matching the result of
* a btrfs_bin_search ( ) call , so there is no need
* to adjust it .
*/
do_bin_search = false ;
path - > slots [ 0 ] = 0 ;
}
}
}
if ( do_bin_search ) {
ret = search_for_key_slot ( leaf , search_low_slot , key ,
prev_cmp , & path - > slots [ 0 ] ) ;
if ( ret < 0 )
return ret ;
}
if ( ins_len > 0 ) {
/*
* Item key already exists . In this case , if we are allowed to
* insert the item ( for example , in dir_item case , item key
* collision is allowed ) , it will be merged with the original
* item . Only the item size grows , no new btrfs item will be
* added . If search_for_extension is not set , ins_len already
* accounts the size btrfs_item , deduct it here so leaf space
* check will be correct .
*/
if ( ret = = 0 & & ! path - > search_for_extension ) {
ASSERT ( ins_len > = sizeof ( struct btrfs_item ) ) ;
ins_len - = sizeof ( struct btrfs_item ) ;
}
ASSERT ( leaf_free_space > = 0 ) ;
if ( leaf_free_space < ins_len ) {
int err ;
err = split_leaf ( trans , root , key , path , ins_len ,
( ret = = 0 ) ) ;
2021-12-02 13:30:39 +03:00
ASSERT ( err < = 0 ) ;
if ( WARN_ON ( err > 0 ) )
err = - EUCLEAN ;
2021-12-02 13:30:38 +03:00
if ( err )
ret = err ;
}
}
return ret ;
}
2007-02-02 19:05:29 +03:00
/*
2023-09-08 02:09:25 +03:00
* Look for a key in a tree and perform necessary modifications to preserve
* tree invariants .
2007-02-02 19:05:29 +03:00
*
2017-12-13 10:38:14 +03:00
* @ trans : Handle of transaction , used when modifying the tree
* @ p : Holds all btree nodes along the search path
* @ root : The root node of the tree
* @ key : The key we are looking for
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 12:25:12 +03:00
* @ ins_len : Indicates purpose of search :
* > 0 for inserts it ' s size of item inserted ( * )
* < 0 for deletions
* 0 for plain searches , not modifying the tree
*
* ( * ) If size of item inserted doesn ' t include
* sizeof ( struct btrfs_item ) , then p - > search_for_extension must
* be set .
2017-12-13 10:38:14 +03:00
* @ cow : boolean should CoW operations be performed . Must always be 1
* when modifying the tree .
2007-02-24 21:39:08 +03:00
*
2017-12-13 10:38:14 +03:00
* If @ ins_len > 0 , nodes and leaves will be split as we walk down the tree .
* If @ ins_len < 0 , nodes will be merged as we walk down the tree ( if possible )
*
* If @ key is found , 0 is returned and you can find the item in the leaf level
* of the path ( level 0 )
*
* If @ key isn ' t found , 1 is returned and the leaf level of the path ( level 0 )
* points to the slot where it should be inserted
*
* If an error is encountered while searching the tree a negative error number
* is returned
2007-02-02 19:05:29 +03:00
*/
2017-01-18 10:24:37 +03:00
int btrfs_search_slot ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key , struct btrfs_path * p ,
int ins_len , int cow )
2007-01-26 23:51:26 +03:00
{
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * b ;
2007-01-26 23:51:26 +03:00
int slot ;
int ret ;
2009-07-22 17:59:00 +04:00
int err ;
2007-01-26 23:51:26 +03:00
int level ;
2008-06-26 00:01:30 +04:00
int lowest_unlock = 1 ;
2011-07-16 23:23:14 +04:00
/* everything at write_lock_level or lower must be write locked */
int write_lock_level = 0 ;
2007-08-07 23:52:19 +04:00
u8 lowest_level = 0 ;
2012-03-19 23:54:38 +04:00
int min_write_lock_level ;
2013-08-30 18:46:43 +04:00
int prev_cmp ;
2007-08-07 23:52:19 +04:00
2022-11-16 17:23:53 +03:00
might_sleep ( ) ;
2007-08-08 00:15:09 +04:00
lowest_level = p - > lowest_level ;
2008-10-02 03:05:46 +04:00
WARN_ON ( lowest_level & & ins_len > 0 ) ;
2007-03-30 16:47:31 +04:00
WARN_ON ( p - > nodes [ 0 ] ! = NULL ) ;
2013-12-23 15:53:02 +04:00
BUG_ON ( ! cow & & ins_len ) ;
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 21:49:05 +03:00
2022-09-12 22:27:42 +03:00
/*
* For now only allow nowait for read only operations . There ' s no
* strict reason why we can ' t , we just only need it for reads so it ' s
* only implemented for reads .
*/
ASSERT ( ! p - > nowait | | ! cow ) ;
2011-07-16 23:23:14 +04:00
if ( ins_len < 0 ) {
2008-06-26 00:01:30 +04:00
lowest_unlock = 2 ;
2008-08-01 23:11:20 +04:00
2011-07-16 23:23:14 +04:00
/* when we are removing items, we might have to go up to level
* two as we update tree pointers Make sure we keep write
* for those levels as well
*/
write_lock_level = 2 ;
} else if ( ins_len > 0 ) {
/*
* for inserting items , make sure we have a write lock on
* level 1 so we can update keys
*/
write_lock_level = 1 ;
}
if ( ! cow )
write_lock_level = - 1 ;
2013-04-06 00:51:15 +04:00
if ( cow & & ( p - > keep_locks | | p - > lowest_level ) )
2011-07-16 23:23:14 +04:00
write_lock_level = BTRFS_MAX_LEVEL ;
2012-03-19 23:54:38 +04:00
min_write_lock_level = write_lock_level ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
if ( p - > need_commit_sem ) {
ASSERT ( p - > search_commit_root ) ;
2022-09-12 22:27:42 +03:00
if ( p - > nowait ) {
if ( ! down_read_trylock ( & fs_info - > commit_root_sem ) )
return - EAGAIN ;
} else {
down_read ( & fs_info - > commit_root_sem ) ;
}
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
}
2007-03-01 20:04:21 +03:00
again :
2013-08-30 18:46:43 +04:00
prev_cmp = - 1 ;
2018-05-18 06:00:21 +03:00
b = btrfs_search_slot_get_root ( root , p , write_lock_level ) ;
2018-12-11 13:19:45 +03:00
if ( IS_ERR ( b ) ) {
ret = PTR_ERR ( b ) ;
goto done ;
}
2008-06-26 00:01:30 +04:00
2007-02-02 17:18:22 +03:00
while ( b ) {
2019-09-10 10:40:17 +03:00
int dec = 0 ;
2007-10-16 00:14:19 +04:00
level = btrfs_header_level ( b ) ;
2008-08-01 23:11:20 +04:00
2007-03-03 00:08:05 +03:00
if ( cow ) {
2017-12-12 12:14:49 +03:00
bool last_level = ( level = = ( BTRFS_MAX_LEVEL - 1 ) ) ;
2009-04-03 18:14:18 +04:00
/*
* if we don ' t really need to cow this block
* then we don ' t want to set the path blocking ,
* so we test it here
*/
btrfs: always abort the transaction if we abort a trans handle
While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.
Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.
This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data. This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.
This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.
Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-20 18:21:31 +03:00
if ( ! should_cow_block ( trans , root , b ) )
2008-08-01 23:11:20 +04:00
goto cow_done ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
2011-07-16 23:23:14 +04:00
/*
* must have write locks on this node and the
* parent
*/
2012-11-07 22:44:13 +04:00
if ( level > write_lock_level | |
( level + 1 > write_lock_level & &
level + 1 < BTRFS_MAX_LEVEL & &
p - > nodes [ level + 1 ] ) ) {
2011-07-16 23:23:14 +04:00
write_lock_level = level + 1 ;
btrfs_release_path ( p ) ;
goto again ;
}
2017-12-12 12:14:49 +03:00
if ( last_level )
err = btrfs_cow_block ( trans , root , b , NULL , 0 ,
2020-08-20 18:46:03 +03:00
& b ,
BTRFS_NESTING_COW ) ;
2017-12-12 12:14:49 +03:00
else
err = btrfs_cow_block ( trans , root , b ,
p - > nodes [ level + 1 ] ,
2020-08-20 18:46:03 +03:00
p - > slots [ level + 1 ] , & b ,
BTRFS_NESTING_COW ) ;
2009-07-22 17:59:00 +04:00
if ( err ) {
ret = err ;
2008-08-01 23:11:20 +04:00
goto done ;
2007-06-22 22:16:25 +04:00
}
2007-03-03 00:08:05 +03:00
}
2008-08-01 23:11:20 +04:00
cow_done :
2007-02-02 17:18:22 +03:00
p - > nodes [ level ] = b ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
/*
* we have a lock on b and as long as we aren ' t changing
* the tree , there is no way to for the items in b to change .
* It is safe to drop the lock on our parent before we
* go through the expensive btree search on b .
*
2013-12-23 15:53:02 +04:00
* If we ' re inserting or deleting ( ins_len ! = 0 ) , then we might
* be changing slot zero , which may require changing the parent .
* So , we can ' t drop the lock until after we know which slot
* we ' re operating on .
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
*/
2013-12-23 15:53:02 +04:00
if ( ! ins_len & & ! p - > keep_locks ) {
int u = level + 1 ;
if ( u < BTRFS_MAX_LEVEL & & p - > locks [ u ] ) {
btrfs_tree_unlock_rw ( p - > nodes [ u ] , p - > locks [ u ] ) ;
p - > locks [ u ] = 0 ;
}
}
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
btrfs: try to unlock parent nodes earlier when inserting a key
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.
However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.
In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.
So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).
The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:
$ cat run-fsmark.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 165273.6 5958381
0 2400000 0 190938.3 6284477
0 3600000 0 181429.1 6044059
0 4800000 0 173979.2 6223418
0 6000000 0 139288.0 6384560
0 7200000 0 163000.4 6520083
1 8400000 0 57799.2 5388544
1 9600000 0 66461.6 5552969
2 10800000 0 49593.5 5163675
2 12000000 0 57672.1 4889398
After this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 167987.3 (+1.6%) 6272730
0 2400000 0 198563.9 (+4.0%) 6048847
0 3600000 0 197436.6 (+8.8%) 6163637
0 4800000 0 202880.7 (+16.6%) 6371771
1 6000000 0 167275.9 (+20.1%) 6556733
1 7200000 0 204051.2 (+25.2%) 6817091
1 8400000 0 69622.8 (+20.5%) 5525675
1 9600000 0 69384.5 (+4.4%) 5700723
1 10800000 0 61454.1 (+23.9%) 5363754
3 12000000 0 61908.7 (+7.3%) 5370196
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-02 13:30:36 +03:00
if ( level = = 0 ) {
2021-12-02 13:30:38 +03:00
if ( ins_len > 0 )
2021-12-02 13:30:37 +03:00
ASSERT ( write_lock_level > = 1 ) ;
2011-07-16 23:23:14 +04:00
2021-12-02 13:30:38 +03:00
ret = search_leaf ( trans , root , key , p , ins_len , prev_cmp ) ;
2008-12-10 17:10:46 +03:00
if ( ! p - > search_for_split )
2012-03-19 23:54:38 +04:00
unlock_up ( p , level , lowest_unlock ,
2018-08-14 05:46:53 +03:00
min_write_lock_level , NULL ) ;
2008-08-01 23:11:20 +04:00
goto done ;
2007-01-26 23:51:26 +03:00
}
btrfs: try to unlock parent nodes earlier when inserting a key
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.
However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.
In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.
So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).
The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:
$ cat run-fsmark.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 165273.6 5958381
0 2400000 0 190938.3 6284477
0 3600000 0 181429.1 6044059
0 4800000 0 173979.2 6223418
0 6000000 0 139288.0 6384560
0 7200000 0 163000.4 6520083
1 8400000 0 57799.2 5388544
1 9600000 0 66461.6 5552969
2 10800000 0 49593.5 5163675
2 12000000 0 57672.1 4889398
After this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 167987.3 (+1.6%) 6272730
0 2400000 0 198563.9 (+4.0%) 6048847
0 3600000 0 197436.6 (+8.8%) 6163637
0 4800000 0 202880.7 (+16.6%) 6371771
1 6000000 0 167275.9 (+20.1%) 6556733
1 7200000 0 204051.2 (+25.2%) 6817091
1 8400000 0 69622.8 (+20.5%) 5525675
1 9600000 0 69384.5 (+4.4%) 5700723
1 10800000 0 61454.1 (+23.9%) 5363754
3 12000000 0 61908.7 (+7.3%) 5370196
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-02 13:30:36 +03:00
ret = search_for_key_slot ( b , 0 , key , prev_cmp , & slot ) ;
if ( ret < 0 )
goto done ;
prev_cmp = ret ;
2019-09-10 10:40:17 +03:00
if ( ret & & slot > 0 ) {
dec = 1 ;
slot - - ;
}
p - > slots [ level ] = slot ;
err = setup_nodes_for_search ( trans , root , p , b , level , ins_len ,
& write_lock_level ) ;
if ( err = = - EAGAIN )
goto again ;
if ( err ) {
ret = err ;
goto done ;
}
b = p - > nodes [ level ] ;
slot = p - > slots [ level ] ;
/*
* Slot 0 is special , if we change the key we have to update
* the parent pointer which means we must have a write lock on
* the parent
*/
if ( slot = = 0 & & ins_len & & write_lock_level < level + 1 ) {
write_lock_level = level + 1 ;
btrfs_release_path ( p ) ;
goto again ;
}
unlock_up ( p , level , lowest_unlock , min_write_lock_level ,
& write_lock_level ) ;
if ( level = = lowest_level ) {
if ( dec )
p - > slots [ level ] + + ;
goto done ;
}
err = read_block_for_search ( root , p , & b , level , slot , key ) ;
if ( err = = - EAGAIN )
goto again ;
if ( err ) {
ret = err ;
goto done ;
}
if ( ! p - > skip_locking ) {
level = btrfs_header_level ( b ) ;
btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-26 23:24:04 +03:00
btrfs_maybe_reset_lockdep_class ( root , b ) ;
2019-09-10 10:40:17 +03:00
if ( level < = write_lock_level ) {
2020-08-20 18:46:10 +03:00
btrfs_tree_lock ( b ) ;
2019-09-10 10:40:17 +03:00
p - > locks [ level ] = BTRFS_WRITE_LOCK ;
} else {
2022-09-12 22:27:42 +03:00
if ( p - > nowait ) {
if ( ! btrfs_try_tree_read_lock ( b ) ) {
free_extent_buffer ( b ) ;
ret = - EAGAIN ;
goto done ;
}
} else {
btrfs_tree_read_lock ( b ) ;
}
2019-09-10 10:40:17 +03:00
p - > locks [ level ] = BTRFS_READ_LOCK ;
}
p - > nodes [ level ] = b ;
}
2007-01-26 23:51:26 +03:00
}
2008-08-01 23:11:20 +04:00
ret = 1 ;
done :
2014-11-09 11:38:39 +03:00
if ( ret < 0 & & ! p - > skip_release_on_error )
2011-04-21 03:20:15 +04:00
btrfs_release_path ( p ) ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
if ( p - > need_commit_sem ) {
int ret2 ;
ret2 = finish_need_commit_sem_search ( p ) ;
up_read ( & fs_info - > commit_root_sem ) ;
if ( ret2 )
ret = ret2 ;
}
2008-08-01 23:11:20 +04:00
return ret ;
2007-01-26 23:51:26 +03:00
}
2020-12-16 19:18:43 +03:00
ALLOW_ERROR_INJECTION ( btrfs_search_slot , ERRNO ) ;
2007-01-26 23:51:26 +03:00
2012-05-16 20:25:47 +04:00
/*
* Like btrfs_search_slot , this looks for a key in the given tree . It uses the
* current state of the tree together with the operations recorded in the tree
* modification log to search for the key in a previous version of this tree , as
* denoted by the time_seq parameter .
*
* Naturally , there is no support for insert , delete or cow operations .
*
* The resulting path and return value will be set up as if we called
* btrfs_search_slot at that point in time with ins_len and cow both set to 0.
*/
2017-01-18 10:24:37 +03:00
int btrfs_search_old_slot ( struct btrfs_root * root , const struct btrfs_key * key ,
2012-05-16 20:25:47 +04:00
struct btrfs_path * p , u64 time_seq )
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2012-05-16 20:25:47 +04:00
struct extent_buffer * b ;
int slot ;
int ret ;
int err ;
int level ;
int lowest_unlock = 1 ;
u8 lowest_level = 0 ;
lowest_level = p - > lowest_level ;
WARN_ON ( p - > nodes [ 0 ] ! = NULL ) ;
2022-09-12 22:27:51 +03:00
ASSERT ( ! p - > nowait ) ;
2012-05-16 20:25:47 +04:00
if ( p - > search_commit_root ) {
BUG_ON ( time_seq ) ;
return btrfs_search_slot ( NULL , root , key , p , 0 , 0 ) ;
}
again :
2021-03-11 17:31:07 +03:00
b = btrfs_get_old_root ( root , time_seq ) ;
2018-09-13 11:35:10 +03:00
if ( ! b ) {
ret = - EIO ;
goto done ;
}
2012-05-16 20:25:47 +04:00
level = btrfs_header_level ( b ) ;
p - > locks [ level ] = BTRFS_READ_LOCK ;
while ( b ) {
2019-09-10 10:40:18 +03:00
int dec = 0 ;
2012-05-16 20:25:47 +04:00
level = btrfs_header_level ( b ) ;
p - > nodes [ level ] = b ;
/*
* we have a lock on b and as long as we aren ' t changing
* the tree , there is no way to for the items in b to change .
* It is safe to drop the lock on our parent before we
* go through the expensive btree search on b .
*/
btrfs_unlock_up_safe ( p , level + 1 ) ;
2023-02-24 06:31:26 +03:00
ret = btrfs_bin_search ( b , 0 , key , & slot ) ;
2019-02-18 19:57:26 +03:00
if ( ret < 0 )
goto done ;
2012-05-16 20:25:47 +04:00
2019-09-10 10:40:18 +03:00
if ( level = = 0 ) {
2012-05-16 20:25:47 +04:00
p - > slots [ level ] = slot ;
unlock_up ( p , level , lowest_unlock , 0 , NULL ) ;
2019-09-10 10:40:18 +03:00
goto done ;
}
2012-05-16 20:25:47 +04:00
2019-09-10 10:40:18 +03:00
if ( ret & & slot > 0 ) {
dec = 1 ;
slot - - ;
}
p - > slots [ level ] = slot ;
unlock_up ( p , level , lowest_unlock , 0 , NULL ) ;
2012-05-16 20:25:47 +04:00
2019-09-10 10:40:18 +03:00
if ( level = = lowest_level ) {
if ( dec )
p - > slots [ level ] + + ;
goto done ;
}
2012-05-16 20:25:47 +04:00
2019-09-10 10:40:18 +03:00
err = read_block_for_search ( root , p , & b , level , slot , key ) ;
if ( err = = - EAGAIN )
goto again ;
if ( err ) {
ret = err ;
2012-05-16 20:25:47 +04:00
goto done ;
}
2019-09-10 10:40:18 +03:00
level = btrfs_header_level ( b ) ;
2020-08-20 18:46:10 +03:00
btrfs_tree_read_lock ( b ) ;
2021-03-11 17:31:07 +03:00
b = btrfs_tree_mod_log_rewind ( fs_info , p , b , time_seq ) ;
2019-09-10 10:40:18 +03:00
if ( ! b ) {
ret = - ENOMEM ;
goto done ;
}
p - > locks [ level ] = BTRFS_READ_LOCK ;
p - > nodes [ level ] = b ;
2012-05-16 20:25:47 +04:00
}
ret = 1 ;
done :
if ( ret < 0 )
btrfs_release_path ( p ) ;
return ret ;
}
2023-04-12 13:33:10 +03:00
/*
* Search the tree again to find a leaf with smaller keys .
* Returns 0 if it found something .
* Returns 1 if there are no smaller keys .
* Returns < 0 on error .
*
* This may release the path , and so you may lose any locks held at the
* time you call it .
*/
static int btrfs_prev_leaf ( struct btrfs_root * root , struct btrfs_path * path )
{
struct btrfs_key key ;
struct btrfs_key orig_key ;
struct btrfs_disk_key found_key ;
int ret ;
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & key , 0 ) ;
orig_key = key ;
if ( key . offset > 0 ) {
key . offset - - ;
} else if ( key . type > 0 ) {
key . type - - ;
key . offset = ( u64 ) - 1 ;
} else if ( key . objectid > 0 ) {
key . objectid - - ;
key . type = ( u8 ) - 1 ;
key . offset = ( u64 ) - 1 ;
} else {
return 1 ;
}
btrfs_release_path ( path ) ;
ret = btrfs_search_slot ( NULL , root , & key , path , 0 , 0 ) ;
if ( ret < = 0 )
return ret ;
/*
* Previous key not found . Even if we were at slot 0 of the leaf we had
* before releasing the path and calling btrfs_search_slot ( ) , we now may
* be in a slot pointing to the same original key - this can happen if
* after we released the path , one of more items were moved from a
* sibling leaf into the front of the leaf we had due to an insertion
* ( see push_leaf_right ( ) ) .
* If we hit this case and our slot is > 0 and just decrement the slot
* so that the caller does not process the same key again , which may or
* may not break the caller , depending on its logic .
*/
if ( path - > slots [ 0 ] < btrfs_header_nritems ( path - > nodes [ 0 ] ) ) {
btrfs_item_key ( path - > nodes [ 0 ] , & found_key , path - > slots [ 0 ] ) ;
ret = comp_keys ( & found_key , & orig_key ) ;
if ( ret = = 0 ) {
if ( path - > slots [ 0 ] > 0 ) {
path - > slots [ 0 ] - - ;
return 0 ;
}
/*
* At slot 0 , same key as before , it means orig_key is
* the lowest , leftmost , key in the tree . We ' re done .
*/
return 1 ;
}
}
btrfs_item_key ( path - > nodes [ 0 ] , & found_key , 0 ) ;
ret = comp_keys ( & found_key , & key ) ;
/*
* We might have had an item with the previous key in the tree right
* before we released our path . And after we released our path , that
* item might have been pushed to the first slot ( 0 ) of the leaf we
* were holding due to a tree balance . Alternatively , an item with the
* previous key can exist as the only element of a leaf ( big fat item ) .
* Therefore account for these 2 cases , so that our callers ( like
* btrfs_previous_item ) don ' t miss an existing item with a key matching
* the previous key we computed above .
*/
if ( ret < = 0 )
return 0 ;
return 1 ;
}
2011-09-13 13:18:10 +04:00
/*
* helper to use instead of search slot if no exact match is needed but
* instead the next or previous item should be returned .
* When find_higher is true , the next higher item is returned , the next lower
* otherwise .
* When return_any and find_higher are both true , and no higher item is found ,
* return the next lower instead .
* When return_any is true and find_higher is false , and no lower item is found ,
* return the next higher instead .
* It returns 0 if any item is found , 1 if none is found ( tree empty ) , and
* < 0 on error
*/
int btrfs_search_slot_for_read ( struct btrfs_root * root ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * key ,
struct btrfs_path * p , int find_higher ,
int return_any )
2011-09-13 13:18:10 +04:00
{
int ret ;
struct extent_buffer * leaf ;
again :
ret = btrfs_search_slot ( NULL , root , key , p , 0 , 0 ) ;
if ( ret < = 0 )
return ret ;
/*
* a return value of 1 means the path is at the position where the
* item should be inserted . Normally this is the next bigger item ,
* but in case the previous item is the last in a leaf , path points
* to the first free slot in the previous leaf , i . e . at an invalid
* item .
*/
leaf = p - > nodes [ 0 ] ;
if ( find_higher ) {
if ( p - > slots [ 0 ] > = btrfs_header_nritems ( leaf ) ) {
ret = btrfs_next_leaf ( root , p ) ;
if ( ret < = 0 )
return ret ;
if ( ! return_any )
return 1 ;
/*
* no higher item found , return the next
* lower instead
*/
return_any = 0 ;
find_higher = 0 ;
btrfs_release_path ( p ) ;
goto again ;
}
} else {
2011-09-13 13:18:10 +04:00
if ( p - > slots [ 0 ] = = 0 ) {
ret = btrfs_prev_leaf ( root , p ) ;
if ( ret < 0 )
return ret ;
if ( ! ret ) {
2014-01-12 01:28:54 +04:00
leaf = p - > nodes [ 0 ] ;
if ( p - > slots [ 0 ] = = btrfs_header_nritems ( leaf ) )
p - > slots [ 0 ] - - ;
2011-09-13 13:18:10 +04:00
return 0 ;
2011-09-13 13:18:10 +04:00
}
2011-09-13 13:18:10 +04:00
if ( ! return_any )
return 1 ;
/*
* no lower item found , return the next
* higher instead
*/
return_any = 0 ;
find_higher = 1 ;
btrfs_release_path ( p ) ;
goto again ;
} else {
2011-09-13 13:18:10 +04:00
- - p - > slots [ 0 ] ;
}
}
return 0 ;
}
2021-07-29 11:22:16 +03:00
/*
* Execute search and call btrfs_previous_item to traverse backwards if the item
* was not found .
*
* Return 0 if found , 1 if not found and < 0 if error .
*/
int btrfs_search_backwards ( struct btrfs_root * root , struct btrfs_key * key ,
struct btrfs_path * path )
{
int ret ;
ret = btrfs_search_slot ( NULL , root , key , path , 0 , 0 ) ;
if ( ret > 0 )
ret = btrfs_previous_item ( root , path , key - > objectid , key - > type ) ;
if ( ret = = 0 )
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , key , path - > slots [ 0 ] ) ;
return ret ;
}
2022-10-27 15:21:42 +03:00
/*
2022-03-09 16:50:38 +03:00
* Search for a valid slot for the given path .
*
* @ root : The root node of the tree .
* @ key : Will contain a valid item if found .
* @ path : The starting point to validate the slot .
*
* Return : 0 if the item is valid
* 1 if not found
* < 0 if error .
*/
int btrfs_get_next_valid_item ( struct btrfs_root * root , struct btrfs_key * key ,
struct btrfs_path * path )
{
2023-04-05 20:52:23 +03:00
if ( path - > slots [ 0 ] > = btrfs_header_nritems ( path - > nodes [ 0 ] ) ) {
2022-03-09 16:50:38 +03:00
int ret ;
2023-04-05 20:52:23 +03:00
ret = btrfs_next_leaf ( root , path ) ;
if ( ret )
return ret ;
2022-03-09 16:50:38 +03:00
}
2023-04-05 20:52:23 +03:00
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , key , path - > slots [ 0 ] ) ;
2022-03-09 16:50:38 +03:00
return 0 ;
}
2007-02-02 19:05:29 +03:00
/*
* adjust the pointers going up the tree , starting at level
* making sure the right key of each node is points to ' key ' .
* This is used after shifting pointers to the left , so it stops
* fixing up pointers when a given leaf / node is not in slot 0 of the
* higher levels
2007-03-01 00:35:06 +03:00
*
2007-02-02 19:05:29 +03:00
*/
2018-06-20 15:48:47 +03:00
static void fixup_low_keys ( struct btrfs_path * path ,
2012-03-01 17:56:26 +04:00
struct btrfs_disk_key * key , int level )
2007-01-26 23:51:26 +03:00
{
int i ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * t ;
2018-03-05 18:16:54 +03:00
int ret ;
2007-10-16 00:14:19 +04:00
2007-03-13 17:46:10 +03:00
for ( i = level ; i < BTRFS_MAX_LEVEL ; i + + ) {
2007-01-26 23:51:26 +03:00
int tslot = path - > slots [ i ] ;
2018-03-05 18:16:54 +03:00
2007-02-02 17:18:22 +03:00
if ( ! path - > nodes [ i ] )
2007-01-26 23:51:26 +03:00
break ;
2007-10-16 00:14:19 +04:00
t = path - > nodes [ i ] ;
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( t , tslot ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_REPLACE ) ;
2018-03-05 18:16:54 +03:00
BUG_ON ( ret < 0 ) ;
2007-10-16 00:14:19 +04:00
btrfs_set_node_key ( t , key , tslot ) ;
2007-03-30 22:27:56 +04:00
btrfs_mark_buffer_dirty ( path - > nodes [ i ] ) ;
2007-01-26 23:51:26 +03:00
if ( tslot ! = 0 )
break ;
}
}
2008-09-23 21:14:14 +04:00
/*
* update item key .
*
* This function isn ' t completely safe . It ' s the caller ' s responsibility
* that the new key won ' t break the order
*/
2014-11-12 07:43:09 +03:00
void btrfs_set_item_key_safe ( struct btrfs_fs_info * fs_info ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * new_key )
2008-09-23 21:14:14 +04:00
{
struct btrfs_disk_key disk_key ;
struct extent_buffer * eb ;
int slot ;
eb = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
if ( slot > 0 ) {
btrfs_item_key ( eb , & disk_key , slot - 1 ) ;
2019-04-25 03:55:53 +03:00
if ( unlikely ( comp_keys ( & disk_key , new_key ) > = 0 ) ) {
2023-04-27 15:16:28 +03:00
btrfs_print_leaf ( eb ) ;
2019-04-25 03:55:53 +03:00
btrfs_crit ( fs_info ,
" slot %u key (%llu %u %llu) new key (%llu %u %llu) " ,
slot , btrfs_disk_key_objectid ( & disk_key ) ,
btrfs_disk_key_type ( & disk_key ) ,
btrfs_disk_key_offset ( & disk_key ) ,
new_key - > objectid , new_key - > type ,
new_key - > offset ) ;
BUG ( ) ;
}
2008-09-23 21:14:14 +04:00
}
if ( slot < btrfs_header_nritems ( eb ) - 1 ) {
btrfs_item_key ( eb , & disk_key , slot + 1 ) ;
2019-04-25 03:55:53 +03:00
if ( unlikely ( comp_keys ( & disk_key , new_key ) < = 0 ) ) {
2023-04-27 15:16:28 +03:00
btrfs_print_leaf ( eb ) ;
2019-04-25 03:55:53 +03:00
btrfs_crit ( fs_info ,
" slot %u key (%llu %u %llu) new key (%llu %u %llu) " ,
slot , btrfs_disk_key_objectid ( & disk_key ) ,
btrfs_disk_key_type ( & disk_key ) ,
btrfs_disk_key_offset ( & disk_key ) ,
new_key - > objectid , new_key - > type ,
new_key - > offset ) ;
BUG ( ) ;
}
2008-09-23 21:14:14 +04:00
}
btrfs_cpu_key_to_disk ( & disk_key , new_key ) ;
btrfs_set_item_key ( eb , & disk_key , slot ) ;
btrfs_mark_buffer_dirty ( eb ) ;
if ( slot = = 0 )
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , 1 ) ;
2008-09-23 21:14:14 +04:00
}
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
/*
* Check key order of two sibling extent buffers .
*
* Return true if something is wrong .
* Return false if everything is fine .
*
* Tree - checker only works inside one tree block , thus the following
* corruption can not be detected by tree - checker :
*
* Leaf @ left | Leaf @ right
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* | 1 | 2 | 3 | 4 | 5 | f6 | | 7 | 8 |
*
* Key f6 in leaf @ left itself is valid , but not valid when the next
* key in leaf @ right is 7.
* This can only be checked at tree block merge time .
* And since tree checker has ensured all key order in each tree block
* is correct , we only need to bother the last key of @ left and the first
* key of @ right .
*/
static bool check_sibling_keys ( struct extent_buffer * left ,
struct extent_buffer * right )
{
struct btrfs_key left_last ;
struct btrfs_key right_first ;
int level = btrfs_header_level ( left ) ;
int nr_left = btrfs_header_nritems ( left ) ;
int nr_right = btrfs_header_nritems ( right ) ;
/* No key to check in one of the tree blocks */
if ( ! nr_left | | ! nr_right )
return false ;
if ( level ) {
btrfs_node_key_to_cpu ( left , & left_last , nr_left - 1 ) ;
btrfs_node_key_to_cpu ( right , & right_first , 0 ) ;
} else {
btrfs_item_key_to_cpu ( left , & left_last , nr_left - 1 ) ;
btrfs_item_key_to_cpu ( right , & right_first , 0 ) ;
}
2023-04-26 13:51:37 +03:00
if ( unlikely ( btrfs_comp_cpu_keys ( & left_last , & right_first ) > = 0 ) ) {
2023-04-26 13:51:36 +03:00
btrfs_crit ( left - > fs_info , " left extent buffer: " ) ;
btrfs_print_tree ( left , false ) ;
btrfs_crit ( left - > fs_info , " right extent buffer: " ) ;
btrfs_print_tree ( right , false ) ;
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
btrfs_crit ( left - > fs_info ,
" bad key order, sibling blocks, left last (%llu %u %llu) right first (%llu %u %llu) " ,
left_last . objectid , left_last . type ,
left_last . offset , right_first . objectid ,
right_first . type , right_first . offset ) ;
return true ;
}
return false ;
}
2007-02-02 19:05:29 +03:00
/*
* try to push data from one node into the next node left in the
2007-03-01 23:16:26 +03:00
* tree .
2007-03-01 00:35:06 +03:00
*
* returns 0 if some ptrs were pushed left , < 0 if there was some horrible
* error , and > 0 if there was no room in the left hand block .
2007-02-02 19:05:29 +03:00
*/
2008-01-03 18:01:48 +03:00
static int push_node_left ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct extent_buffer * dst ,
2008-04-24 18:54:32 +04:00
struct extent_buffer * src , int empty )
2007-01-26 23:51:26 +03:00
{
2019-03-20 16:16:45 +03:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2007-01-26 23:51:26 +03:00
int push_items = 0 ;
2007-03-01 20:04:21 +03:00
int src_nritems ;
int dst_nritems ;
2007-03-01 00:35:06 +03:00
int ret = 0 ;
2007-01-26 23:51:26 +03:00
2007-10-16 00:14:19 +04:00
src_nritems = btrfs_header_nritems ( src ) ;
dst_nritems = btrfs_header_nritems ( dst ) ;
2016-06-23 01:54:23 +03:00
push_items = BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) - dst_nritems ;
2007-12-11 17:25:06 +03:00
WARN_ON ( btrfs_header_generation ( src ) ! = trans - > transid ) ;
WARN_ON ( btrfs_header_generation ( dst ) ! = trans - > transid ) ;
2007-06-22 22:16:25 +04:00
2008-04-24 22:42:46 +04:00
if ( ! empty & & src_nritems < = 8 )
2008-04-24 18:54:32 +04:00
return 1 ;
2009-01-06 05:25:51 +03:00
if ( push_items < = 0 )
2007-01-26 23:51:26 +03:00
return 1 ;
2008-04-24 22:42:46 +04:00
if ( empty ) {
2008-04-24 18:54:32 +04:00
push_items = min ( src_nritems , push_items ) ;
2008-04-24 22:42:46 +04:00
if ( push_items < src_nritems ) {
/* leave at least 8 pointers in the node if
* we aren ' t going to empty it
*/
if ( src_nritems - push_items < 8 ) {
if ( push_items < = 8 )
return 1 ;
push_items - = 8 ;
}
}
} else
push_items = min ( src_nritems - 8 , push_items ) ;
2007-03-01 23:16:26 +03:00
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
/* dst is the left eb, src is the middle eb */
if ( check_sibling_keys ( dst , src ) ) {
ret = - EUCLEAN ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_eb_copy ( dst , src , dst_nritems , 0 , push_items ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
if ( ret ) {
2016-06-11 01:19:25 +03:00
btrfs_abort_transaction ( trans , ret ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
return ret ;
}
2007-10-16 00:14:19 +04:00
copy_extent_buffer ( dst , src ,
2022-11-15 19:16:16 +03:00
btrfs_node_key_ptr_offset ( dst , dst_nritems ) ,
btrfs_node_key_ptr_offset ( src , 0 ) ,
2009-01-06 05:25:51 +03:00
push_items * sizeof ( struct btrfs_key_ptr ) ) ;
2007-10-16 00:14:19 +04:00
2007-03-01 20:04:21 +03:00
if ( push_items < src_nritems ) {
2012-10-19 11:22:03 +04:00
/*
btrfs: insert tree mod log move in push_node_left
There is a fairly unlikely race condition in tree mod log rewind that
can result in a kernel panic which has the following trace:
[530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096
[530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096
[530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002
[530.618] #PF: supervisor read access in kernel mode
[530.629] #PF: error_code(0x0000) - not-present page
[530.641] PGD 0 P4D 0
[530.647] Oops: 0000 [#1] SMP
[530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S O K 5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
[530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
[530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00
[530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
[530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100
[530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8
[530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff
[530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000
[530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0
[530.848] FS: 00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000
[530.866] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0
[530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[530.928] Call Trace:
[530.934] ? btrfs_printk+0x13b/0x18c
[530.943] ? btrfs_bio_counter_inc_blocked+0x3d/0x130
[530.955] btrfs_map_bio+0x75/0x330
[530.963] ? kmem_cache_alloc+0x12a/0x2d0
[530.973] ? btrfs_submit_metadata_bio+0x63/0x100
[530.984] btrfs_submit_metadata_bio+0xa4/0x100
[530.995] submit_extent_page+0x30f/0x360
[531.004] read_extent_buffer_pages+0x49e/0x6d0
[531.015] ? submit_extent_page+0x360/0x360
[531.025] btree_read_extent_buffer_pages+0x5f/0x150
[531.037] read_tree_block+0x37/0x60
[531.046] read_block_for_search+0x18b/0x410
[531.056] btrfs_search_old_slot+0x198/0x2f0
[531.066] resolve_indirect_ref+0xfe/0x6f0
[531.076] ? ulist_alloc+0x31/0x60
[531.084] ? kmem_cache_alloc_trace+0x12e/0x2b0
[531.095] find_parent_nodes+0x720/0x1830
[531.105] ? ulist_alloc+0x10/0x60
[531.113] iterate_extent_inodes+0xea/0x370
[531.123] ? btrfs_previous_extent_item+0x8f/0x110
[531.134] ? btrfs_search_path_in_tree+0x240/0x240
[531.146] iterate_inodes_from_logical+0x98/0xd0
[531.157] ? btrfs_search_path_in_tree+0x240/0x240
[531.168] btrfs_ioctl_logical_to_ino+0xd9/0x180
[531.179] btrfs_ioctl+0xe2/0x2eb0
This occurs when logical inode resolution takes a tree mod log sequence
number, and then while backref walking hits a rewind on a busy node
which has the following sequence of tree mod log operations (numbers
filled in from a specific example, but they are somewhat arbitrary)
REMOVE_WHILE_FREEING slot 532
REMOVE_WHILE_FREEING slot 531
REMOVE_WHILE_FREEING slot 530
...
REMOVE_WHILE_FREEING slot 0
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
ADD slot 455
ADD slot 454
ADD slot 453
...
ADD slot 0
MOVE src slot 0 -> dst slot 456 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
When this sequence gets applied via btrfs_tree_mod_log_rewind, it
allocates a fresh rewind eb, and first inserts the correct key info for
the 533 elements, then overwrites the first 456 of them, then decrements
the count by 456 via the add ops, then rewinds the move by doing a
memmove from 456:988->0:532. We have never written anything past 532, so
that memmove writes garbage into the 0:532 range. In practice, this
results in a lot of fully 0 keys. The rewind then puts valid keys into
slots 0:455 with the last removes, but 456:532 are still invalid.
When search_old_slot uses this eb, if it uses one of those invalid
slots, it can then read the extent buffer and issue a bio for offset 0
which ultimately panics looking up extent mappings.
This bad tree mod log sequence gets generated when the node balancing
code happens to do a balance_node_right followed by a push_node_left
while logging in the tree mod log. Illustrated for ebs L and R (left and
right):
L R
start:
[XXX|YYY|...] [ZZZ|...|...]
balance_node_right:
[XXX|YYY|...] [...|ZZZ|...] move Z to make room for Y
[XXX|...|...] [YYY|ZZZ|...] copy Y from L to R
push_node_left:
[XXX|YYY|...] [...|ZZZ|...] copy Y from R to L
[XXX|YYY|...] [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)
This is because balance_node_right logs a move, but push_node_left
explicitly doesn't. That is because logging the move would remove the
overwritten src < dst range in the right eb, which was already logged
when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
include a move from 456:988 to 0:532 after remove 0:455 and before
removing 0:532. Reversing that sequence would entail creating keys for
0:532, then moving those keys out to 456:988, then creating more keys
for 0:455.
i.e.,
REMOVE_WHILE_FREEING slot 532
REMOVE_WHILE_FREEING slot 531
REMOVE_WHILE_FREEING slot 530
...
REMOVE_WHILE_FREEING slot 0
MOVE src slot 456 -> dst slot 0 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
ADD slot 455
ADD slot 454
ADD slot 453
...
ADD slot 0
MOVE src slot 0 -> dst slot 456 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
Fix this to log the move but avoid the double remove by putting all the
logging logic in btrfs_tree_mod_log_eb_copy which has enough information
to detect these cases and properly log moves, removes, and adds. Leave
btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
tree mod logging.
(Un)fortunately, this is quite difficult to reproduce, and I was only
able to reproduce it by adding sleeps in btrfs_search_old_slot that
would encourage more log rewinding during ino_to_logical ioctls. I was
able to hit the warning in the previous patch in the series without the
fix quite quickly, but not after this patch.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 21:55:14 +03:00
* btrfs_tree_mod_log_eb_copy handles logging the move , so we
* don ' t need to do an explicit tree mod log operation for it .
2012-10-19 11:22:03 +04:00
*/
2022-11-15 19:16:16 +03:00
memmove_extent_buffer ( src , btrfs_node_key_ptr_offset ( src , 0 ) ,
btrfs_node_key_ptr_offset ( src , push_items ) ,
2007-10-16 00:14:19 +04:00
( src_nritems - push_items ) *
sizeof ( struct btrfs_key_ptr ) ) ;
}
btrfs_set_header_nritems ( src , src_nritems - push_items ) ;
btrfs_set_header_nritems ( dst , dst_nritems + push_items ) ;
btrfs_mark_buffer_dirty ( src ) ;
btrfs_mark_buffer_dirty ( dst ) ;
2008-09-23 21:14:14 +04:00
2007-03-01 23:16:26 +03:00
return ret ;
}
/*
* try to push data from one node into the next node right in the
* tree .
*
* returns 0 if some ptrs were pushed , < 0 if there was some horrible
* error , and > 0 if there was no room in the right hand block .
*
* this will only push up to 1 / 2 the contents of the left node over
*/
2007-10-16 00:14:19 +04:00
static int balance_node_right ( struct btrfs_trans_handle * trans ,
struct extent_buffer * dst ,
struct extent_buffer * src )
2007-03-01 23:16:26 +03:00
{
2019-03-20 16:18:06 +03:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2007-03-01 23:16:26 +03:00
int push_items = 0 ;
int max_push ;
int src_nritems ;
int dst_nritems ;
int ret = 0 ;
2007-12-11 17:25:06 +03:00
WARN_ON ( btrfs_header_generation ( src ) ! = trans - > transid ) ;
WARN_ON ( btrfs_header_generation ( dst ) ! = trans - > transid ) ;
2007-10-16 00:14:19 +04:00
src_nritems = btrfs_header_nritems ( src ) ;
dst_nritems = btrfs_header_nritems ( dst ) ;
2016-06-23 01:54:23 +03:00
push_items = BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) - dst_nritems ;
2009-01-06 05:25:51 +03:00
if ( push_items < = 0 )
2007-03-01 23:16:26 +03:00
return 1 ;
2008-04-24 22:42:46 +04:00
2009-01-06 05:25:51 +03:00
if ( src_nritems < 4 )
2008-04-24 22:42:46 +04:00
return 1 ;
2007-03-01 23:16:26 +03:00
max_push = src_nritems / 2 + 1 ;
/* don't try to empty the node */
2009-01-06 05:25:51 +03:00
if ( max_push > = src_nritems )
2007-03-01 23:16:26 +03:00
return 1 ;
2007-08-29 17:11:44 +04:00
2007-03-01 23:16:26 +03:00
if ( max_push < push_items )
push_items = max_push ;
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
/* dst is the right eb, src is the middle eb */
if ( check_sibling_keys ( src , dst ) ) {
ret = - EUCLEAN ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
btrfs: insert tree mod log move in push_node_left
There is a fairly unlikely race condition in tree mod log rewind that
can result in a kernel panic which has the following trace:
[530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096
[530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096
[530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002
[530.618] #PF: supervisor read access in kernel mode
[530.629] #PF: error_code(0x0000) - not-present page
[530.641] PGD 0 P4D 0
[530.647] Oops: 0000 [#1] SMP
[530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S O K 5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
[530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
[530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00
[530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
[530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100
[530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8
[530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff
[530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000
[530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0
[530.848] FS: 00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000
[530.866] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0
[530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[530.928] Call Trace:
[530.934] ? btrfs_printk+0x13b/0x18c
[530.943] ? btrfs_bio_counter_inc_blocked+0x3d/0x130
[530.955] btrfs_map_bio+0x75/0x330
[530.963] ? kmem_cache_alloc+0x12a/0x2d0
[530.973] ? btrfs_submit_metadata_bio+0x63/0x100
[530.984] btrfs_submit_metadata_bio+0xa4/0x100
[530.995] submit_extent_page+0x30f/0x360
[531.004] read_extent_buffer_pages+0x49e/0x6d0
[531.015] ? submit_extent_page+0x360/0x360
[531.025] btree_read_extent_buffer_pages+0x5f/0x150
[531.037] read_tree_block+0x37/0x60
[531.046] read_block_for_search+0x18b/0x410
[531.056] btrfs_search_old_slot+0x198/0x2f0
[531.066] resolve_indirect_ref+0xfe/0x6f0
[531.076] ? ulist_alloc+0x31/0x60
[531.084] ? kmem_cache_alloc_trace+0x12e/0x2b0
[531.095] find_parent_nodes+0x720/0x1830
[531.105] ? ulist_alloc+0x10/0x60
[531.113] iterate_extent_inodes+0xea/0x370
[531.123] ? btrfs_previous_extent_item+0x8f/0x110
[531.134] ? btrfs_search_path_in_tree+0x240/0x240
[531.146] iterate_inodes_from_logical+0x98/0xd0
[531.157] ? btrfs_search_path_in_tree+0x240/0x240
[531.168] btrfs_ioctl_logical_to_ino+0xd9/0x180
[531.179] btrfs_ioctl+0xe2/0x2eb0
This occurs when logical inode resolution takes a tree mod log sequence
number, and then while backref walking hits a rewind on a busy node
which has the following sequence of tree mod log operations (numbers
filled in from a specific example, but they are somewhat arbitrary)
REMOVE_WHILE_FREEING slot 532
REMOVE_WHILE_FREEING slot 531
REMOVE_WHILE_FREEING slot 530
...
REMOVE_WHILE_FREEING slot 0
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
ADD slot 455
ADD slot 454
ADD slot 453
...
ADD slot 0
MOVE src slot 0 -> dst slot 456 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
When this sequence gets applied via btrfs_tree_mod_log_rewind, it
allocates a fresh rewind eb, and first inserts the correct key info for
the 533 elements, then overwrites the first 456 of them, then decrements
the count by 456 via the add ops, then rewinds the move by doing a
memmove from 456:988->0:532. We have never written anything past 532, so
that memmove writes garbage into the 0:532 range. In practice, this
results in a lot of fully 0 keys. The rewind then puts valid keys into
slots 0:455 with the last removes, but 456:532 are still invalid.
When search_old_slot uses this eb, if it uses one of those invalid
slots, it can then read the extent buffer and issue a bio for offset 0
which ultimately panics looking up extent mappings.
This bad tree mod log sequence gets generated when the node balancing
code happens to do a balance_node_right followed by a push_node_left
while logging in the tree mod log. Illustrated for ebs L and R (left and
right):
L R
start:
[XXX|YYY|...] [ZZZ|...|...]
balance_node_right:
[XXX|YYY|...] [...|ZZZ|...] move Z to make room for Y
[XXX|...|...] [YYY|ZZZ|...] copy Y from L to R
push_node_left:
[XXX|YYY|...] [...|ZZZ|...] copy Y from R to L
[XXX|YYY|...] [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)
This is because balance_node_right logs a move, but push_node_left
explicitly doesn't. That is because logging the move would remove the
overwritten src < dst range in the right eb, which was already logged
when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
include a move from 456:988 to 0:532 after remove 0:455 and before
removing 0:532. Reversing that sequence would entail creating keys for
0:532, then moving those keys out to 456:988, then creating more keys
for 0:455.
i.e.,
REMOVE_WHILE_FREEING slot 532
REMOVE_WHILE_FREEING slot 531
REMOVE_WHILE_FREEING slot 530
...
REMOVE_WHILE_FREEING slot 0
MOVE src slot 456 -> dst slot 0 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
ADD slot 455
ADD slot 454
ADD slot 453
...
ADD slot 0
MOVE src slot 0 -> dst slot 456 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
Fix this to log the move but avoid the double remove by putting all the
logging logic in btrfs_tree_mod_log_eb_copy which has enough information
to detect these cases and properly log moves, removes, and adds. Leave
btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
tree mod logging.
(Un)fortunately, this is quite difficult to reproduce, and I was only
able to reproduce it by adding sleeps in btrfs_search_old_slot that
would encourage more log rewinding during ino_to_logical ioctls. I was
able to hit the warning in the previous patch in the series without the
fix quite quickly, but not after this patch.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 21:55:14 +03:00
/*
* btrfs_tree_mod_log_eb_copy handles logging the move , so we don ' t
* need to do an explicit tree mod log operation for it .
*/
2022-11-15 19:16:16 +03:00
memmove_extent_buffer ( dst , btrfs_node_key_ptr_offset ( dst , push_items ) ,
btrfs_node_key_ptr_offset ( dst , 0 ) ,
2007-10-16 00:14:19 +04:00
( dst_nritems ) *
sizeof ( struct btrfs_key_ptr ) ) ;
2007-03-30 22:27:56 +04:00
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_eb_copy ( dst , src , 0 , src_nritems - push_items ,
push_items ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
if ( ret ) {
2016-06-11 01:19:25 +03:00
btrfs_abort_transaction ( trans , ret ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
return ret ;
}
2007-10-16 00:14:19 +04:00
copy_extent_buffer ( dst , src ,
2022-11-15 19:16:16 +03:00
btrfs_node_key_ptr_offset ( dst , 0 ) ,
btrfs_node_key_ptr_offset ( src , src_nritems - push_items ) ,
2009-01-06 05:25:51 +03:00
push_items * sizeof ( struct btrfs_key_ptr ) ) ;
2007-03-01 23:16:26 +03:00
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( src , src_nritems - push_items ) ;
btrfs_set_header_nritems ( dst , dst_nritems + push_items ) ;
2007-03-01 23:16:26 +03:00
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( src ) ;
btrfs_mark_buffer_dirty ( dst ) ;
2008-09-23 21:14:14 +04:00
2007-03-01 00:35:06 +03:00
return ret ;
2007-01-26 23:51:26 +03:00
}
2007-02-24 21:39:08 +03:00
/*
* helper function to insert a new root level in the tree .
* A new node is allocated , and a single item is inserted to
* point to the existing root
2007-03-01 00:35:06 +03:00
*
* returns zero on success or < 0 on failure .
2007-02-24 21:39:08 +03:00
*/
2009-01-06 05:25:51 +03:00
static noinline int insert_new_root ( struct btrfs_trans_handle * trans ,
2007-10-16 00:14:19 +04:00
struct btrfs_root * root ,
2013-05-22 16:06:51 +04:00
struct btrfs_path * path , int level )
2007-02-22 19:39:13 +03:00
{
2007-12-11 17:25:06 +03:00
u64 lower_gen ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * lower ;
struct extent_buffer * c ;
2008-06-26 00:01:30 +04:00
struct extent_buffer * old ;
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key lower_key ;
2018-03-05 18:35:29 +03:00
int ret ;
2007-02-22 19:39:13 +03:00
BUG_ON ( path - > nodes [ level ] ) ;
BUG_ON ( path - > nodes [ level - 1 ] ! = root - > node ) ;
2007-12-11 17:25:06 +03:00
lower = path - > nodes [ level - 1 ] ;
if ( level = = 1 )
btrfs_item_key ( lower , & lower_key , 0 ) ;
else
btrfs_node_key ( lower , & lower_key , 0 ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 16:43:06 +03:00
c = btrfs_alloc_tree_block ( trans , root , 0 , root - > root_key . objectid ,
& lower_key , level , root - > node - > start , 0 ,
BTRFS_NESTING_NEW_ROOT ) ;
2007-10-16 00:14:19 +04:00
if ( IS_ERR ( c ) )
return PTR_ERR ( c ) ;
2008-06-26 00:01:30 +04:00
2023-09-08 02:09:33 +03:00
root_add_used_bytes ( root ) ;
2010-05-16 18:46:25 +04:00
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( c , 1 ) ;
btrfs_set_node_key ( c , & lower_key , 0 ) ;
2007-10-16 00:15:53 +04:00
btrfs_set_node_blockptr ( c , 0 , lower - > start ) ;
2007-12-11 17:25:06 +03:00
lower_gen = btrfs_header_generation ( lower ) ;
2008-09-23 21:14:14 +04:00
WARN_ON ( lower_gen ! = trans - > transid ) ;
2007-12-11 17:25:06 +03:00
btrfs_set_node_ptr_generation ( c , 0 , lower_gen ) ;
2007-03-23 17:01:08 +03:00
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( c ) ;
2007-03-23 17:01:08 +03:00
2008-06-26 00:01:30 +04:00
old = root - > node ;
2021-03-11 17:31:08 +03:00
ret = btrfs_tree_mod_log_insert_root ( root - > node , c , false ) ;
2023-06-08 13:27:47 +03:00
if ( ret < 0 ) {
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , c , 0 , 1 ) ;
btrfs_tree_unlock ( c ) ;
free_extent_buffer ( c ) ;
return ret ;
}
2011-03-23 21:54:42 +03:00
rcu_assign_pointer ( root - > node , c ) ;
2008-06-26 00:01:30 +04:00
/* the super has an extra ref to root->node */
free_extent_buffer ( old ) ;
2008-03-24 22:01:56 +03:00
add_root_to_dirty_list ( root ) ;
2019-10-08 14:28:47 +03:00
atomic_inc ( & c - > refs ) ;
2007-10-16 00:14:19 +04:00
path - > nodes [ level ] = c ;
2020-08-20 18:46:10 +03:00
path - > locks [ level ] = BTRFS_WRITE_LOCK ;
2007-02-22 19:39:13 +03:00
path - > slots [ level ] = 0 ;
return 0 ;
}
2007-02-02 19:05:29 +03:00
/*
* worker function to insert a single pointer in a node .
* the node should have enough room for the pointer already
2007-02-24 21:39:08 +03:00
*
2007-02-02 19:05:29 +03:00
* slot and level indicate where you want the key to go , and
* blocknr is the block the key points to .
*/
2023-06-08 13:27:48 +03:00
static int insert_ptr ( struct btrfs_trans_handle * trans ,
struct btrfs_path * path ,
struct btrfs_disk_key * key , u64 bytenr ,
int slot , int level )
2007-02-02 19:05:29 +03:00
{
2007-10-16 00:14:19 +04:00
struct extent_buffer * lower ;
2007-02-02 19:05:29 +03:00
int nritems ;
2012-05-26 13:45:21 +04:00
int ret ;
2007-02-22 19:39:13 +03:00
BUG_ON ( ! path - > nodes [ level ] ) ;
2021-09-22 12:36:45 +03:00
btrfs_assert_tree_write_locked ( path - > nodes [ level ] ) ;
2007-10-16 00:14:19 +04:00
lower = path - > nodes [ level ] ;
nritems = btrfs_header_nritems ( lower ) ;
2009-04-03 01:05:11 +04:00
BUG_ON ( slot > nritems ) ;
2019-03-20 16:32:45 +03:00
BUG_ON ( nritems = = BTRFS_NODEPTRS_PER_BLOCK ( trans - > fs_info ) ) ;
2007-02-02 19:05:29 +03:00
if ( slot ! = nritems ) {
2018-03-05 17:47:39 +03:00
if ( level ) {
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_move ( lower , slot + 1 ,
slot , nritems - slot ) ;
2023-06-08 13:27:48 +03:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2018-03-05 17:47:39 +03:00
}
2007-10-16 00:14:19 +04:00
memmove_extent_buffer ( lower ,
2022-11-15 19:16:16 +03:00
btrfs_node_key_ptr_offset ( lower , slot + 1 ) ,
btrfs_node_key_ptr_offset ( lower , slot ) ,
2007-03-30 22:27:56 +04:00
( nritems - slot ) * sizeof ( struct btrfs_key_ptr ) ) ;
2007-02-02 19:05:29 +03:00
}
2012-06-21 13:01:06 +04:00
if ( level ) {
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( lower , slot ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_ADD ) ;
2023-06-08 13:27:48 +03:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2012-05-26 13:45:21 +04:00
}
2007-10-16 00:14:19 +04:00
btrfs_set_node_key ( lower , key , slot ) ;
2007-10-16 00:15:53 +04:00
btrfs_set_node_blockptr ( lower , slot , bytenr ) ;
2007-12-11 17:25:06 +03:00
WARN_ON ( trans - > transid = = 0 ) ;
btrfs_set_node_ptr_generation ( lower , slot , trans - > transid ) ;
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( lower , nritems + 1 ) ;
btrfs_mark_buffer_dirty ( lower ) ;
2023-06-08 13:27:48 +03:00
return 0 ;
2007-02-02 19:05:29 +03:00
}
2007-02-24 21:39:08 +03:00
/*
* split the node at the specified level in path in two .
* The path is corrected to point to the appropriate node after the split
*
* Before splitting this tries to make some room in the node by pushing
* left and right , if either one works , it returns right away .
2007-03-01 00:35:06 +03:00
*
* returns 0 on success and < 0 on failure
2007-02-24 21:39:08 +03:00
*/
2008-09-06 00:13:11 +04:00
static noinline int split_node ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , int level )
2007-01-26 23:51:26 +03:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * c ;
struct extent_buffer * split ;
struct btrfs_disk_key disk_key ;
2007-01-26 23:51:26 +03:00
int mid ;
2007-02-22 19:39:13 +03:00
int ret ;
2007-03-12 19:01:18 +03:00
u32 c_nritems ;
2007-02-02 17:18:22 +03:00
2007-10-16 00:14:19 +04:00
c = path - > nodes [ level ] ;
2007-12-11 17:25:06 +03:00
WARN_ON ( btrfs_header_generation ( c ) ! = trans - > transid ) ;
2007-10-16 00:14:19 +04:00
if ( c = = root - > node ) {
2013-03-20 17:49:48 +04:00
/*
2013-04-13 17:19:53 +04:00
* trying to split the root , lets make a new one
*
2013-05-22 16:06:51 +04:00
* tree mod log : We don ' t log_removal old root in
2013-04-13 17:19:53 +04:00
* insert_new_root , because that root buffer will be kept as a
* normal node . We are going to log removal of half of the
2021-03-11 17:31:07 +03:00
* elements below with btrfs_tree_mod_log_eb_copy ( ) . We ' re
* holding a tree lock on the buffer , which is why we cannot
* race with other tree_mod_log users .
2013-03-20 17:49:48 +04:00
*/
2013-05-22 16:06:51 +04:00
ret = insert_new_root ( trans , root , path , level + 1 ) ;
2007-02-22 19:39:13 +03:00
if ( ret )
return ret ;
2009-05-14 03:12:15 +04:00
} else {
2007-04-20 21:16:02 +04:00
ret = push_nodes_for_insert ( trans , root , path , level ) ;
2007-10-16 00:14:19 +04:00
c = path - > nodes [ level ] ;
if ( ! ret & & btrfs_header_nritems ( c ) <
2016-06-23 01:54:23 +03:00
BTRFS_NODEPTRS_PER_BLOCK ( fs_info ) - 3 )
2007-04-20 21:16:02 +04:00
return 0 ;
2007-06-22 22:16:25 +04:00
if ( ret < 0 )
return ret ;
2007-01-26 23:51:26 +03:00
}
2007-04-20 21:16:02 +04:00
2007-10-16 00:14:19 +04:00
c_nritems = btrfs_header_nritems ( c ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
mid = ( c_nritems + 1 ) / 2 ;
btrfs_node_key ( c , & disk_key , mid ) ;
2007-12-11 17:25:06 +03:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 16:43:06 +03:00
split = btrfs_alloc_tree_block ( trans , root , 0 , root - > root_key . objectid ,
& disk_key , level , c - > start , 0 ,
BTRFS_NESTING_SPLIT ) ;
2007-10-16 00:14:19 +04:00
if ( IS_ERR ( split ) )
return PTR_ERR ( split ) ;
2023-09-08 02:09:33 +03:00
root_add_used_bytes ( root ) ;
2018-06-18 14:13:19 +03:00
ASSERT ( btrfs_header_level ( c ) = = level ) ;
2007-06-22 22:16:25 +04:00
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_eb_copy ( split , c , 0 , mid , c_nritems - mid ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
if ( ret ) {
2023-06-08 13:27:38 +03:00
btrfs_tree_unlock ( split ) ;
free_extent_buffer ( split ) ;
2016-06-11 01:19:25 +03:00
btrfs_abort_transaction ( trans , ret ) ;
Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc6341653721b54760480b0d0d9b5f09b46741
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-20 19:17:46 +04:00
return ret ;
}
2007-10-16 00:14:19 +04:00
copy_extent_buffer ( split , c ,
2022-11-15 19:16:16 +03:00
btrfs_node_key_ptr_offset ( split , 0 ) ,
btrfs_node_key_ptr_offset ( c , mid ) ,
2007-10-16 00:14:19 +04:00
( c_nritems - mid ) * sizeof ( struct btrfs_key_ptr ) ) ;
btrfs_set_header_nritems ( split , c_nritems - mid ) ;
btrfs_set_header_nritems ( c , mid ) ;
2007-03-01 00:35:06 +03:00
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( c ) ;
btrfs_mark_buffer_dirty ( split ) ;
2023-06-08 13:27:48 +03:00
ret = insert_ptr ( trans , path , & disk_key , split - > start ,
path - > slots [ level + 1 ] + 1 , level + 1 ) ;
if ( ret < 0 ) {
btrfs_tree_unlock ( split ) ;
free_extent_buffer ( split ) ;
return ret ;
}
2007-03-01 00:35:06 +03:00
2007-02-24 14:24:44 +03:00
if ( path - > slots [ level ] > = mid ) {
2007-02-22 19:39:13 +03:00
path - > slots [ level ] - = mid ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( c ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( c ) ;
path - > nodes [ level ] = split ;
2007-02-22 19:39:13 +03:00
path - > slots [ level + 1 ] + = 1 ;
} else {
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( split ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( split ) ;
2007-01-26 23:51:26 +03:00
}
2020-11-12 14:24:02 +03:00
return 0 ;
2007-01-26 23:51:26 +03:00
}
2007-02-02 19:05:29 +03:00
/*
* how many bytes are required to store the items in a leaf . start
* and nr indicate which items in the leaf to check . This totals up the
* space used both by the item structs and the item data
*/
2023-04-27 15:16:27 +03:00
static int leaf_space_used ( const struct extent_buffer * l , int start , int nr )
2007-01-26 23:51:26 +03:00
{
int data_len ;
2007-10-16 00:14:19 +04:00
int nritems = btrfs_header_nritems ( l ) ;
2007-04-04 22:08:15 +04:00
int end = min ( nritems , start + nr ) - 1 ;
2007-01-26 23:51:26 +03:00
if ( ! nr )
return 0 ;
2021-10-21 21:58:35 +03:00
data_len = btrfs_item_offset ( l , start ) + btrfs_item_size ( l , start ) ;
data_len = data_len - btrfs_item_offset ( l , end ) ;
2007-03-13 03:12:07 +03:00
data_len + = sizeof ( struct btrfs_item ) * nr ;
2007-04-04 22:08:15 +04:00
WARN_ON ( data_len < 0 ) ;
2007-01-26 23:51:26 +03:00
return data_len ;
}
2007-04-04 22:08:15 +04:00
/*
* The space between the end of the leaf items and
* the start of the leaf data . IOW , how much room
* the leaf has left for both items and data
*/
2023-04-27 15:16:27 +03:00
int btrfs_leaf_free_space ( const struct extent_buffer * leaf )
2007-04-04 22:08:15 +04:00
{
2019-03-20 16:36:46 +03:00
struct btrfs_fs_info * fs_info = leaf - > fs_info ;
2007-10-16 00:14:19 +04:00
int nritems = btrfs_header_nritems ( leaf ) ;
int ret ;
2016-06-23 01:54:23 +03:00
ret = BTRFS_LEAF_DATA_SIZE ( fs_info ) - leaf_space_used ( leaf , 0 , nritems ) ;
2007-10-16 00:14:19 +04:00
if ( ret < 0 ) {
2016-06-23 01:54:23 +03:00
btrfs_crit ( fs_info ,
" leaf free space ret %d, leaf data size %lu, used %d nritems %d " ,
ret ,
( unsigned long ) BTRFS_LEAF_DATA_SIZE ( fs_info ) ,
leaf_space_used ( leaf , 0 , nritems ) , nritems ) ;
2007-10-16 00:14:19 +04:00
}
return ret ;
2007-04-04 22:08:15 +04:00
}
2010-07-07 18:51:48 +04:00
/*
* min slot controls the lowest index we ' re willing to push to the
* right . We ' ll push up to and including min_slot , but no lower
*/
2023-01-27 00:00:55 +03:00
static noinline int __push_leaf_right ( struct btrfs_trans_handle * trans ,
struct btrfs_path * path ,
2009-03-13 17:04:31 +03:00
int data_size , int empty ,
struct extent_buffer * right ,
2010-07-07 18:51:48 +04:00
int free_space , u32 left_nritems ,
u32 min_slot )
2007-02-24 20:47:20 +03:00
{
2019-03-20 16:39:45 +03:00
struct btrfs_fs_info * fs_info = right - > fs_info ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * left = path - > nodes [ 0 ] ;
2009-03-13 17:04:31 +03:00
struct extent_buffer * upper = path - > nodes [ 1 ] ;
2012-03-03 16:40:03 +04:00
struct btrfs_map_token token ;
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
2007-02-24 20:47:20 +03:00
int slot ;
2007-11-07 21:31:03 +03:00
u32 i ;
2007-02-24 20:47:20 +03:00
int push_space = 0 ;
int push_items = 0 ;
2007-11-07 21:31:03 +03:00
u32 nr ;
2007-03-12 19:01:18 +03:00
u32 right_nritems ;
2007-10-16 00:14:19 +04:00
u32 data_end ;
2007-10-16 00:15:53 +04:00
u32 this_item_size ;
2007-02-24 20:47:20 +03:00
2007-11-07 21:31:03 +03:00
if ( empty )
nr = 0 ;
else
2010-07-07 18:51:48 +04:00
nr = max_t ( u32 , 1 , min_slot ) ;
2007-11-07 21:31:03 +03:00
2008-09-23 21:14:14 +04:00
if ( path - > slots [ 0 ] > = left_nritems )
2008-12-17 18:21:48 +03:00
push_space + = data_size ;
2008-09-23 21:14:14 +04:00
2009-03-13 17:04:31 +03:00
slot = path - > slots [ 1 ] ;
2007-11-07 21:31:03 +03:00
i = left_nritems - 1 ;
while ( i > = nr ) {
2008-09-23 21:14:14 +04:00
if ( ! empty & & push_items > 0 ) {
if ( path - > slots [ 0 ] > i )
break ;
if ( path - > slots [ 0 ] = = i ) {
2019-03-20 16:36:46 +03:00
int space = btrfs_leaf_free_space ( left ) ;
2008-09-23 21:14:14 +04:00
if ( space + push_space * 2 > free_space )
break ;
}
}
2007-02-24 20:47:20 +03:00
if ( path - > slots [ 0 ] = = i )
2008-12-17 18:21:48 +03:00
push_space + = data_size ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
this_item_size = btrfs_item_size ( left , i ) ;
2021-10-21 21:58:34 +03:00
if ( this_item_size + sizeof ( struct btrfs_item ) +
push_space > free_space )
2007-02-24 20:47:20 +03:00
break ;
2008-09-23 21:14:14 +04:00
2007-02-24 20:47:20 +03:00
push_items + + ;
2021-10-21 21:58:34 +03:00
push_space + = this_item_size + sizeof ( struct btrfs_item ) ;
2007-11-07 21:31:03 +03:00
if ( i = = 0 )
break ;
i - - ;
2007-10-16 00:15:53 +04:00
}
2007-10-16 00:14:19 +04:00
2008-06-26 00:01:30 +04:00
if ( push_items = = 0 )
goto out_unlock ;
2007-10-16 00:14:19 +04:00
2012-11-04 00:30:18 +04:00
WARN_ON ( ! empty & & push_items = = left_nritems ) ;
2007-10-16 00:14:19 +04:00
2007-02-24 20:47:20 +03:00
/* push left to right */
2007-10-16 00:14:19 +04:00
right_nritems = btrfs_header_nritems ( right ) ;
2007-11-07 21:31:03 +03:00
2021-10-21 21:58:37 +03:00
push_space = btrfs_item_data_end ( left , left_nritems - push_items ) ;
2019-03-20 13:33:10 +03:00
push_space - = leaf_data_end ( left ) ;
2007-10-16 00:14:19 +04:00
2007-02-24 20:47:20 +03:00
/* make room in the right data area */
2019-03-20 13:33:10 +03:00
data_end = leaf_data_end ( right ) ;
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( right , data_end - push_space , data_end ,
BTRFS_LEAF_DATA_SIZE ( fs_info ) - data_end ) ;
2007-10-16 00:14:19 +04:00
2007-02-24 20:47:20 +03:00
/* copy from the left data area */
2022-11-15 19:16:17 +03:00
copy_leaf_data ( right , left , BTRFS_LEAF_DATA_SIZE ( fs_info ) - push_space ,
leaf_data_end ( left ) , push_space ) ;
2007-10-16 00:14:19 +04:00
2022-11-15 19:16:17 +03:00
memmove_leaf_items ( right , push_items , 0 , right_nritems ) ;
2007-10-16 00:14:19 +04:00
2007-02-24 20:47:20 +03:00
/* copy the items from left to right */
2022-11-15 19:16:17 +03:00
copy_leaf_items ( right , left , 0 , left_nritems - push_items , push_items ) ;
2007-02-24 20:47:20 +03:00
/* update the item pointers */
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , right ) ;
2007-03-12 19:01:18 +03:00
right_nritems + = push_items ;
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( right , right_nritems ) ;
2016-06-23 01:54:23 +03:00
push_space = BTRFS_LEAF_DATA_SIZE ( fs_info ) ;
2007-03-12 19:01:18 +03:00
for ( i = 0 ; i < right_nritems ; i + + ) {
2021-10-21 21:58:35 +03:00
push_space - = btrfs_token_item_size ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i , push_space ) ;
2007-10-16 00:15:53 +04:00
}
2007-03-12 19:01:18 +03:00
left_nritems - = push_items ;
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( left , left_nritems ) ;
2007-02-24 20:47:20 +03:00
2007-11-07 21:31:03 +03:00
if ( left_nritems )
btrfs_mark_buffer_dirty ( left ) ;
2010-05-16 18:46:25 +04:00
else
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , left ) ;
2010-05-16 18:46:25 +04:00
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( right ) ;
2007-04-19 00:15:28 +04:00
2007-10-16 00:14:19 +04:00
btrfs_item_key ( right , & disk_key , 0 ) ;
btrfs_set_node_key ( upper , & disk_key , slot + 1 ) ;
2007-03-30 22:27:56 +04:00
btrfs_mark_buffer_dirty ( upper ) ;
2007-03-03 00:08:05 +03:00
2007-02-24 20:47:20 +03:00
/* then fixup the leaf pointer in the path */
2007-03-12 19:01:18 +03:00
if ( path - > slots [ 0 ] > = left_nritems ) {
path - > slots [ 0 ] - = left_nritems ;
2008-06-26 00:01:30 +04:00
if ( btrfs_header_nritems ( path - > nodes [ 0 ] ) = = 0 )
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , path - > nodes [ 0 ] ) ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( path - > nodes [ 0 ] ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( path - > nodes [ 0 ] ) ;
path - > nodes [ 0 ] = right ;
2007-02-24 20:47:20 +03:00
path - > slots [ 1 ] + = 1 ;
} else {
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( right ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( right ) ;
2007-02-24 20:47:20 +03:00
}
return 0 ;
2008-06-26 00:01:30 +04:00
out_unlock :
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
return 1 ;
2007-02-24 20:47:20 +03:00
}
2008-06-26 00:01:30 +04:00
2009-03-13 17:04:31 +03:00
/*
* push some data in the path leaf to the right , trying to free up at
* least data_size bytes . returns zero if the push worked , nonzero otherwise
*
* returns 1 if the push failed because the other node didn ' t have enough
* room , 0 if everything worked out and < 0 if there were major errors .
2010-07-07 18:51:48 +04:00
*
* this will push starting from min_slot to the end of the leaf . It won ' t
* push any slot lower than min_slot
2009-03-13 17:04:31 +03:00
*/
static int push_leaf_right ( struct btrfs_trans_handle * trans , struct btrfs_root
2010-07-07 18:51:48 +04:00
* root , struct btrfs_path * path ,
int min_data_size , int data_size ,
int empty , u32 min_slot )
2009-03-13 17:04:31 +03:00
{
struct extent_buffer * left = path - > nodes [ 0 ] ;
struct extent_buffer * right ;
struct extent_buffer * upper ;
int slot ;
int free_space ;
u32 left_nritems ;
int ret ;
if ( ! path - > nodes [ 1 ] )
return 1 ;
slot = path - > slots [ 1 ] ;
upper = path - > nodes [ 1 ] ;
if ( slot > = btrfs_header_nritems ( upper ) - 1 )
return 1 ;
2021-09-22 12:36:45 +03:00
btrfs_assert_tree_write_locked ( path - > nodes [ 1 ] ) ;
2009-03-13 17:04:31 +03:00
2019-08-21 20:16:27 +03:00
right = btrfs_read_node_slot ( upper , slot + 1 ) ;
2016-07-05 22:10:14 +03:00
if ( IS_ERR ( right ) )
2023-02-07 19:57:21 +03:00
return PTR_ERR ( right ) ;
2011-01-05 05:32:22 +03:00
2020-08-20 18:46:04 +03:00
__btrfs_tree_lock ( right , BTRFS_NESTING_RIGHT ) ;
2009-03-13 17:04:31 +03:00
2019-03-20 16:36:46 +03:00
free_space = btrfs_leaf_free_space ( right ) ;
2009-03-13 17:04:31 +03:00
if ( free_space < data_size )
goto out_unlock ;
ret = btrfs_cow_block ( trans , root , right , upper ,
2020-08-20 18:46:05 +03:00
slot + 1 , & right , BTRFS_NESTING_RIGHT_COW ) ;
2009-03-13 17:04:31 +03:00
if ( ret )
goto out_unlock ;
left_nritems = btrfs_header_nritems ( left ) ;
if ( left_nritems = = 0 )
goto out_unlock ;
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
if ( check_sibling_keys ( left , right ) ) {
ret = - EUCLEAN ;
2023-04-26 13:51:35 +03:00
btrfs_abort_transaction ( trans , ret ) ;
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
return ret ;
}
2013-12-05 02:17:39 +04:00
if ( path - > slots [ 0 ] = = left_nritems & & ! empty ) {
/* Key greater than all keys in the leaf, right neighbor has
* enough room for it and we ' re not emptying our leaf to delete
* it , therefore use right neighbor to insert the new item and
2018-11-28 14:05:13 +03:00
* no need to touch / dirty our left leaf . */
2013-12-05 02:17:39 +04:00
btrfs_tree_unlock ( left ) ;
free_extent_buffer ( left ) ;
path - > nodes [ 0 ] = right ;
path - > slots [ 0 ] = 0 ;
path - > slots [ 1 ] + + ;
return 0 ;
}
2023-01-27 00:00:55 +03:00
return __push_leaf_right ( trans , path , min_data_size , empty , right ,
free_space , left_nritems , min_slot ) ;
2009-03-13 17:04:31 +03:00
out_unlock :
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
return 1 ;
}
2007-02-02 19:05:29 +03:00
/*
* push some data in the path leaf to the left , trying to free up at
* least data_size bytes . returns zero if the push worked , nonzero otherwise
2010-07-07 18:51:48 +04:00
*
* max_slot can put a limit on how far into the leaf we ' ll push items . The
* item at ' max_slot ' won ' t be touched . Use ( u32 ) - 1 to make us do all the
* items
2007-02-02 19:05:29 +03:00
*/
2023-01-27 00:00:55 +03:00
static noinline int __push_leaf_left ( struct btrfs_trans_handle * trans ,
struct btrfs_path * path , int data_size ,
2009-03-13 17:04:31 +03:00
int empty , struct extent_buffer * left ,
2010-07-07 18:51:48 +04:00
int free_space , u32 right_nritems ,
u32 max_slot )
2007-01-26 23:51:26 +03:00
{
2019-03-20 16:40:41 +03:00
struct btrfs_fs_info * fs_info = left - > fs_info ;
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
struct extent_buffer * right = path - > nodes [ 0 ] ;
2007-01-26 23:51:26 +03:00
int i ;
int push_space = 0 ;
int push_items = 0 ;
2007-03-12 19:01:18 +03:00
u32 old_left_nritems ;
2007-11-07 21:31:03 +03:00
u32 nr ;
2007-03-01 00:35:06 +03:00
int ret = 0 ;
2007-10-16 00:15:53 +04:00
u32 this_item_size ;
u32 old_left_item_size ;
2012-03-03 16:40:03 +04:00
struct btrfs_map_token token ;
2007-11-07 21:31:03 +03:00
if ( empty )
2010-07-07 18:51:48 +04:00
nr = min ( right_nritems , max_slot ) ;
2007-11-07 21:31:03 +03:00
else
2010-07-07 18:51:48 +04:00
nr = min ( right_nritems - 1 , max_slot ) ;
2007-11-07 21:31:03 +03:00
for ( i = 0 ; i < nr ; i + + ) {
2008-09-23 21:14:14 +04:00
if ( ! empty & & push_items > 0 ) {
if ( path - > slots [ 0 ] < i )
break ;
if ( path - > slots [ 0 ] = = i ) {
2019-03-20 16:36:46 +03:00
int space = btrfs_leaf_free_space ( right ) ;
2008-09-23 21:14:14 +04:00
if ( space + push_space * 2 > free_space )
break ;
}
}
2007-01-26 23:51:26 +03:00
if ( path - > slots [ 0 ] = = i )
2008-12-17 18:21:48 +03:00
push_space + = data_size ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
this_item_size = btrfs_item_size ( right , i ) ;
2021-10-21 21:58:34 +03:00
if ( this_item_size + sizeof ( struct btrfs_item ) + push_space >
free_space )
2007-01-26 23:51:26 +03:00
break ;
2007-10-16 00:15:53 +04:00
2007-01-26 23:51:26 +03:00
push_items + + ;
2021-10-21 21:58:34 +03:00
push_space + = this_item_size + sizeof ( struct btrfs_item ) ;
2007-10-16 00:15:53 +04:00
}
2007-01-26 23:51:26 +03:00
if ( push_items = = 0 ) {
2008-06-26 00:01:30 +04:00
ret = 1 ;
goto out ;
2007-01-26 23:51:26 +03:00
}
2013-10-31 09:00:08 +04:00
WARN_ON ( ! empty & & push_items = = btrfs_header_nritems ( right ) ) ;
2007-10-16 00:14:19 +04:00
2007-01-26 23:51:26 +03:00
/* push data from right to left */
2022-11-15 19:16:17 +03:00
copy_leaf_items ( left , right , btrfs_header_nritems ( left ) , 0 , push_items ) ;
2007-10-16 00:14:19 +04:00
2016-06-23 01:54:23 +03:00
push_space = BTRFS_LEAF_DATA_SIZE ( fs_info ) -
2021-10-21 21:58:35 +03:00
btrfs_item_offset ( right , push_items - 1 ) ;
2007-10-16 00:14:19 +04:00
2022-11-15 19:16:17 +03:00
copy_leaf_data ( left , right , leaf_data_end ( left ) - push_space ,
btrfs_item_offset ( right , push_items - 1 ) , push_space ) ;
2007-10-16 00:14:19 +04:00
old_left_nritems = btrfs_header_nritems ( left ) ;
2008-12-17 18:21:48 +03:00
BUG_ON ( old_left_nritems < = 0 ) ;
2007-02-02 17:18:22 +03:00
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , left ) ;
2021-10-21 21:58:35 +03:00
old_left_item_size = btrfs_item_offset ( left , old_left_nritems - 1 ) ;
2007-03-13 03:12:07 +03:00
for ( i = old_left_nritems ; i < old_left_nritems + push_items ; i + + ) {
2007-10-16 00:14:19 +04:00
u32 ioff ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
ioff = btrfs_token_item_offset ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i ,
2020-04-29 03:15:56 +03:00
ioff - ( BTRFS_LEAF_DATA_SIZE ( fs_info ) - old_left_item_size ) ) ;
2007-01-26 23:51:26 +03:00
}
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( left , old_left_nritems + push_items ) ;
2007-01-26 23:51:26 +03:00
/* fixup right node */
2012-11-03 14:58:34 +04:00
if ( push_items > right_nritems )
WARN ( 1 , KERN_CRIT " push items %d nr %u \n " , push_items ,
2009-01-06 05:25:51 +03:00
right_nritems ) ;
2007-11-07 21:31:03 +03:00
if ( push_items < right_nritems ) {
2021-10-21 21:58:35 +03:00
push_space = btrfs_item_offset ( right , push_items - 1 ) -
2019-03-20 13:33:10 +03:00
leaf_data_end ( right ) ;
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( right ,
BTRFS_LEAF_DATA_SIZE ( fs_info ) - push_space ,
leaf_data_end ( right ) , push_space ) ;
memmove_leaf_items ( right , 0 , push_items ,
btrfs_header_nritems ( right ) - push_items ) ;
2007-11-07 21:31:03 +03:00
}
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , right ) ;
2007-11-26 18:58:13 +03:00
right_nritems - = push_items ;
btrfs_set_header_nritems ( right , right_nritems ) ;
2016-06-23 01:54:23 +03:00
push_space = BTRFS_LEAF_DATA_SIZE ( fs_info ) ;
2007-10-16 00:14:19 +04:00
for ( i = 0 ; i < right_nritems ; i + + ) {
2021-10-21 21:58:35 +03:00
push_space = push_space - btrfs_token_item_size ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i , push_space ) ;
2007-10-16 00:15:53 +04:00
}
2007-02-02 17:18:22 +03:00
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( left ) ;
2007-11-07 21:31:03 +03:00
if ( right_nritems )
btrfs_mark_buffer_dirty ( right ) ;
2010-05-16 18:46:25 +04:00
else
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , right ) ;
2007-05-11 19:33:21 +04:00
2007-10-16 00:14:19 +04:00
btrfs_item_key ( right , & disk_key , 0 ) ;
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , 1 ) ;
2007-01-26 23:51:26 +03:00
/* then fixup the leaf pointer in the path */
if ( path - > slots [ 0 ] < push_items ) {
path - > slots [ 0 ] + = old_left_nritems ;
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( path - > nodes [ 0 ] ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( path - > nodes [ 0 ] ) ;
path - > nodes [ 0 ] = left ;
2007-01-26 23:51:26 +03:00
path - > slots [ 1 ] - = 1 ;
} else {
2008-06-26 00:01:30 +04:00
btrfs_tree_unlock ( left ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( left ) ;
2007-01-26 23:51:26 +03:00
path - > slots [ 0 ] - = push_items ;
}
2007-02-02 17:18:22 +03:00
BUG_ON ( path - > slots [ 0 ] < 0 ) ;
2007-03-01 00:35:06 +03:00
return ret ;
2008-06-26 00:01:30 +04:00
out :
btrfs_tree_unlock ( left ) ;
free_extent_buffer ( left ) ;
return ret ;
2007-01-26 23:51:26 +03:00
}
2009-03-13 17:04:31 +03:00
/*
* push some data in the path leaf to the left , trying to free up at
* least data_size bytes . returns zero if the push worked , nonzero otherwise
2010-07-07 18:51:48 +04:00
*
* max_slot can put a limit on how far into the leaf we ' ll push items . The
* item at ' max_slot ' won ' t be touched . Use ( u32 ) - 1 to make us push all the
* items
2009-03-13 17:04:31 +03:00
*/
static int push_leaf_left ( struct btrfs_trans_handle * trans , struct btrfs_root
2010-07-07 18:51:48 +04:00
* root , struct btrfs_path * path , int min_data_size ,
int data_size , int empty , u32 max_slot )
2009-03-13 17:04:31 +03:00
{
struct extent_buffer * right = path - > nodes [ 0 ] ;
struct extent_buffer * left ;
int slot ;
int free_space ;
u32 right_nritems ;
int ret = 0 ;
slot = path - > slots [ 1 ] ;
if ( slot = = 0 )
return 1 ;
if ( ! path - > nodes [ 1 ] )
return 1 ;
right_nritems = btrfs_header_nritems ( right ) ;
if ( right_nritems = = 0 )
return 1 ;
2021-09-22 12:36:45 +03:00
btrfs_assert_tree_write_locked ( path - > nodes [ 1 ] ) ;
2009-03-13 17:04:31 +03:00
2019-08-21 20:16:27 +03:00
left = btrfs_read_node_slot ( path - > nodes [ 1 ] , slot - 1 ) ;
2016-07-05 22:10:14 +03:00
if ( IS_ERR ( left ) )
2023-02-07 19:57:21 +03:00
return PTR_ERR ( left ) ;
2011-01-05 05:32:22 +03:00
2020-08-20 18:46:04 +03:00
__btrfs_tree_lock ( left , BTRFS_NESTING_LEFT ) ;
2009-03-13 17:04:31 +03:00
2019-03-20 16:36:46 +03:00
free_space = btrfs_leaf_free_space ( left ) ;
2009-03-13 17:04:31 +03:00
if ( free_space < data_size ) {
ret = 1 ;
goto out ;
}
ret = btrfs_cow_block ( trans , root , left ,
2020-08-20 18:46:03 +03:00
path - > nodes [ 1 ] , slot - 1 , & left ,
2020-08-20 18:46:05 +03:00
BTRFS_NESTING_LEFT_COW ) ;
2009-03-13 17:04:31 +03:00
if ( ret ) {
/* we hit -ENOSPC, but it isn't fatal here */
2012-03-12 19:03:00 +04:00
if ( ret = = - ENOSPC )
ret = 1 ;
2009-03-13 17:04:31 +03:00
goto out ;
}
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
if ( check_sibling_keys ( left , right ) ) {
ret = - EUCLEAN ;
2023-04-26 13:51:35 +03:00
btrfs_abort_transaction ( trans , ret ) ;
btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():
kernel BUG at fs/btrfs/ctree.c:3188!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
Call Trace:
truncate_one_csum+0xac/0xf0
btrfs_del_csums+0x24f/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace 93bf9db00e6c374e ]---
[CAUSE]
This crafted image has a tricky key order corruption:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
...
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 09:35:50 +03:00
goto out ;
}
2023-01-27 00:00:55 +03:00
return __push_leaf_left ( trans , path , min_data_size , empty , left ,
free_space , right_nritems , max_slot ) ;
2009-03-13 17:04:31 +03:00
out :
btrfs_tree_unlock ( left ) ;
free_extent_buffer ( left ) ;
return ret ;
}
/*
* split the path ' s leaf in two , making sure there is at least data_size
* available for the resulting leaf level of the path .
*/
2023-06-08 13:27:48 +03:00
static noinline int copy_for_split ( struct btrfs_trans_handle * trans ,
struct btrfs_path * path ,
struct extent_buffer * l ,
struct extent_buffer * right ,
int slot , int mid , int nritems )
2009-03-13 17:04:31 +03:00
{
2019-03-20 16:42:33 +03:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2009-03-13 17:04:31 +03:00
int data_copy_size ;
int rt_data_off ;
int i ;
2023-06-08 13:27:48 +03:00
int ret ;
2009-03-13 17:04:31 +03:00
struct btrfs_disk_key disk_key ;
2012-03-03 16:40:03 +04:00
struct btrfs_map_token token ;
2009-03-13 17:04:31 +03:00
nritems = nritems - mid ;
btrfs_set_header_nritems ( right , nritems ) ;
2021-10-21 21:58:37 +03:00
data_copy_size = btrfs_item_data_end ( l , mid ) - leaf_data_end ( l ) ;
2009-03-13 17:04:31 +03:00
2022-11-15 19:16:17 +03:00
copy_leaf_items ( right , l , 0 , mid , nritems ) ;
2009-03-13 17:04:31 +03:00
2022-11-15 19:16:17 +03:00
copy_leaf_data ( right , l , BTRFS_LEAF_DATA_SIZE ( fs_info ) - data_copy_size ,
leaf_data_end ( l ) , data_copy_size ) ;
2009-03-13 17:04:31 +03:00
2021-10-21 21:58:37 +03:00
rt_data_off = BTRFS_LEAF_DATA_SIZE ( fs_info ) - btrfs_item_data_end ( l , mid ) ;
2009-03-13 17:04:31 +03:00
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , right ) ;
2009-03-13 17:04:31 +03:00
for ( i = 0 ; i < nritems ; i + + ) {
u32 ioff ;
2021-10-21 21:58:35 +03:00
ioff = btrfs_token_item_offset ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i , ioff + rt_data_off ) ;
2009-03-13 17:04:31 +03:00
}
btrfs_set_header_nritems ( l , mid ) ;
btrfs_item_key ( right , & disk_key , 0 ) ;
2023-06-08 13:27:48 +03:00
ret = insert_ptr ( trans , path , & disk_key , right - > start , path - > slots [ 1 ] + 1 , 1 ) ;
if ( ret < 0 )
return ret ;
2009-03-13 17:04:31 +03:00
btrfs_mark_buffer_dirty ( right ) ;
btrfs_mark_buffer_dirty ( l ) ;
BUG_ON ( path - > slots [ 0 ] ! = slot ) ;
if ( mid < = slot ) {
btrfs_tree_unlock ( path - > nodes [ 0 ] ) ;
free_extent_buffer ( path - > nodes [ 0 ] ) ;
path - > nodes [ 0 ] = right ;
path - > slots [ 0 ] - = mid ;
path - > slots [ 1 ] + = 1 ;
} else {
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
}
BUG_ON ( path - > slots [ 0 ] < 0 ) ;
2023-06-08 13:27:48 +03:00
return 0 ;
2009-03-13 17:04:31 +03:00
}
2010-07-07 18:51:48 +04:00
/*
* double splits happen when we need to insert a big item in the middle
* of a leaf . A double split can leave us with 3 mostly empty leaves :
* leaf : [ slots 0 - N ] [ our target ] [ N + 1 - total in leaf ]
* A B C
*
* We avoid this by trying to push the items on either side of our target
* into the adjacent leaves . If all goes well we can avoid the double split
* completely .
*/
static noinline int push_for_double_split ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
int data_size )
{
int ret ;
int progress = 0 ;
int slot ;
u32 nritems ;
2013-11-25 07:20:46 +04:00
int space_needed = data_size ;
2010-07-07 18:51:48 +04:00
slot = path - > slots [ 0 ] ;
2013-11-25 07:20:46 +04:00
if ( slot < btrfs_header_nritems ( path - > nodes [ 0 ] ) )
2019-03-20 16:36:46 +03:00
space_needed - = btrfs_leaf_free_space ( path - > nodes [ 0 ] ) ;
2010-07-07 18:51:48 +04:00
/*
* try to push all the items after our slot into the
* right leaf
*/
2013-11-25 07:20:46 +04:00
ret = push_leaf_right ( trans , root , path , 1 , space_needed , 0 , slot ) ;
2010-07-07 18:51:48 +04:00
if ( ret < 0 )
return ret ;
if ( ret = = 0 )
progress + + ;
nritems = btrfs_header_nritems ( path - > nodes [ 0 ] ) ;
/*
* our goal is to get our slot at the start or end of a leaf . If
* we ' ve done so we ' re done
*/
if ( path - > slots [ 0 ] = = 0 | | path - > slots [ 0 ] = = nritems )
return 0 ;
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( path - > nodes [ 0 ] ) > = data_size )
2010-07-07 18:51:48 +04:00
return 0 ;
/* try to push all the items before our slot into the next leaf */
slot = path - > slots [ 0 ] ;
2017-02-17 21:43:57 +03:00
space_needed = data_size ;
if ( slot > 0 )
2019-03-20 16:36:46 +03:00
space_needed - = btrfs_leaf_free_space ( path - > nodes [ 0 ] ) ;
2013-11-25 07:20:46 +04:00
ret = push_leaf_left ( trans , root , path , 1 , space_needed , 0 , slot ) ;
2010-07-07 18:51:48 +04:00
if ( ret < 0 )
return ret ;
if ( ret = = 0 )
progress + + ;
if ( progress )
return 0 ;
return 1 ;
}
2007-02-02 19:05:29 +03:00
/*
* split the path ' s leaf in two , making sure there is at least data_size
* available for the resulting leaf level of the path .
2007-03-01 00:35:06 +03:00
*
* returns 0 if all went well and < 0 on failure .
2007-02-02 19:05:29 +03:00
*/
2008-09-06 00:13:11 +04:00
static noinline int split_leaf ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * ins_key ,
2008-09-06 00:13:11 +04:00
struct btrfs_path * path , int data_size ,
int extend )
2007-01-26 23:51:26 +03:00
{
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
struct btrfs_disk_key disk_key ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * l ;
2007-03-12 19:01:18 +03:00
u32 nritems ;
2007-02-02 17:18:22 +03:00
int mid ;
int slot ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * right ;
2014-11-12 07:43:09 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-04-04 22:08:15 +04:00
int ret = 0 ;
2007-03-01 00:35:06 +03:00
int wret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int split ;
2007-10-25 23:42:57 +04:00
int num_doubles = 0 ;
2010-07-07 18:51:48 +04:00
int tried_avoid_double = 0 ;
2007-03-01 00:35:06 +03:00
2009-09-24 17:17:31 +04:00
l = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
2021-10-21 21:58:35 +03:00
if ( extend & & data_size + btrfs_item_size ( l , slot ) +
2016-06-23 01:54:23 +03:00
sizeof ( struct btrfs_item ) > BTRFS_LEAF_DATA_SIZE ( fs_info ) )
2009-09-24 17:17:31 +04:00
return - EOVERFLOW ;
2007-03-17 21:29:23 +03:00
/* first try to make some room by pushing left and right */
2013-05-22 16:07:06 +04:00
if ( data_size & & path - > nodes [ 1 ] ) {
2013-11-25 07:20:46 +04:00
int space_needed = data_size ;
if ( slot < btrfs_header_nritems ( l ) )
2019-03-20 16:36:46 +03:00
space_needed - = btrfs_leaf_free_space ( l ) ;
2013-11-25 07:20:46 +04:00
wret = push_leaf_right ( trans , root , path , space_needed ,
space_needed , 0 , 0 ) ;
2009-01-06 05:25:51 +03:00
if ( wret < 0 )
2007-03-13 18:17:52 +03:00
return wret ;
2007-10-19 17:23:27 +04:00
if ( wret ) {
2017-02-17 21:43:57 +03:00
space_needed = data_size ;
if ( slot > 0 )
2019-03-20 16:36:46 +03:00
space_needed - = btrfs_leaf_free_space ( l ) ;
2013-11-25 07:20:46 +04:00
wret = push_leaf_left ( trans , root , path , space_needed ,
space_needed , 0 , ( u32 ) - 1 ) ;
2007-10-19 17:23:27 +04:00
if ( wret < 0 )
return wret ;
}
l = path - > nodes [ 0 ] ;
2007-03-01 00:35:06 +03:00
2007-10-19 17:23:27 +04:00
/* did the pushes work? */
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( l ) > = data_size )
2007-10-19 17:23:27 +04:00
return 0 ;
2007-10-16 00:18:25 +04:00
}
2007-03-01 00:35:06 +03:00
2007-02-22 19:39:13 +03:00
if ( ! path - > nodes [ 1 ] ) {
2013-05-22 16:06:51 +04:00
ret = insert_new_root ( trans , root , path , 1 ) ;
2007-02-22 19:39:13 +03:00
if ( ret )
return ret ;
}
2007-10-25 23:42:57 +04:00
again :
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
split = 1 ;
2007-10-25 23:42:57 +04:00
l = path - > nodes [ 0 ] ;
2007-02-02 17:18:22 +03:00
slot = path - > slots [ 0 ] ;
2007-10-16 00:14:19 +04:00
nritems = btrfs_header_nritems ( l ) ;
2009-01-06 05:25:51 +03:00
mid = ( nritems + 1 ) / 2 ;
2007-06-22 22:16:25 +04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( mid < = slot ) {
if ( nritems = = 1 | |
leaf_space_used ( l , mid , nritems - mid ) + data_size >
2016-06-23 01:54:23 +03:00
BTRFS_LEAF_DATA_SIZE ( fs_info ) ) {
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( slot > = nritems ) {
split = 0 ;
} else {
mid = slot ;
if ( mid ! = nritems & &
leaf_space_used ( l , mid , nritems - mid ) +
2016-06-23 01:54:23 +03:00
data_size > BTRFS_LEAF_DATA_SIZE ( fs_info ) ) {
2010-07-07 18:51:48 +04:00
if ( data_size & & ! tried_avoid_double )
goto push_for_double ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
split = 2 ;
}
}
}
} else {
if ( leaf_space_used ( l , 0 , mid ) + data_size >
2016-06-23 01:54:23 +03:00
BTRFS_LEAF_DATA_SIZE ( fs_info ) ) {
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( ! extend & & data_size & & slot = = 0 ) {
split = 0 ;
} else if ( ( extend | | ! data_size ) & & slot = = 0 ) {
mid = 1 ;
} else {
mid = slot ;
if ( mid ! = nritems & &
leaf_space_used ( l , mid , nritems - mid ) +
2016-06-23 01:54:23 +03:00
data_size > BTRFS_LEAF_DATA_SIZE ( fs_info ) ) {
2010-07-07 18:51:48 +04:00
if ( data_size & & ! tried_avoid_double )
goto push_for_double ;
2013-10-31 09:03:04 +04:00
split = 2 ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
}
}
}
}
if ( split = = 0 )
btrfs_cpu_key_to_disk ( & disk_key , ins_key ) ;
else
btrfs_item_key ( l , & disk_key , mid ) ;
2020-08-20 18:46:08 +03:00
/*
* We have to about BTRFS_NESTING_NEW_ROOT here if we ' ve done a double
* split , because we ' re only allowed to have MAX_LOCKDEP_SUBCLASSES
* subclasses , which is 8 at the time of this patch , and we ' ve maxed it
* out . In the future we could add a
* BTRFS_NESTING_SPLIT_THE_SPLITTENING if we need to , but for now just
* use BTRFS_NESTING_NEW_ROOT .
*/
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 16:43:06 +03:00
right = btrfs_alloc_tree_block ( trans , root , 0 , root - > root_key . objectid ,
& disk_key , 0 , l - > start , 0 ,
num_doubles ? BTRFS_NESTING_NEW_ROOT :
BTRFS_NESTING_SPLIT ) ;
2010-05-16 18:46:25 +04:00
if ( IS_ERR ( right ) )
2007-10-16 00:14:19 +04:00
return PTR_ERR ( right ) ;
2010-05-16 18:46:25 +04:00
2023-09-08 02:09:33 +03:00
root_add_used_bytes ( root ) ;
2007-10-16 00:14:19 +04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( split = = 0 ) {
if ( mid < = slot ) {
btrfs_set_header_nritems ( right , 0 ) ;
2023-06-08 13:27:48 +03:00
ret = insert_ptr ( trans , path , & disk_key ,
right - > start , path - > slots [ 1 ] + 1 , 1 ) ;
if ( ret < 0 ) {
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
return ret ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
btrfs_tree_unlock ( path - > nodes [ 0 ] ) ;
free_extent_buffer ( path - > nodes [ 0 ] ) ;
path - > nodes [ 0 ] = right ;
path - > slots [ 0 ] = 0 ;
path - > slots [ 1 ] + = 1 ;
} else {
btrfs_set_header_nritems ( right , 0 ) ;
2023-06-08 13:27:48 +03:00
ret = insert_ptr ( trans , path , & disk_key ,
right - > start , path - > slots [ 1 ] , 1 ) ;
if ( ret < 0 ) {
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
return ret ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
btrfs_tree_unlock ( path - > nodes [ 0 ] ) ;
free_extent_buffer ( path - > nodes [ 0 ] ) ;
path - > nodes [ 0 ] = right ;
path - > slots [ 0 ] = 0 ;
2012-03-01 17:56:26 +04:00
if ( path - > slots [ 1 ] = = 0 )
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , 1 ) ;
2007-04-04 22:08:15 +04:00
}
2016-09-08 00:48:28 +03:00
/*
* We create a new leaf ' right ' for the required ins_len and
* we ' ll do btrfs_mark_buffer_dirty ( ) on this leaf after copying
* the content of ins_len to ' right ' .
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
return ret ;
2007-04-04 22:08:15 +04:00
}
2007-02-02 19:05:29 +03:00
2023-06-08 13:27:48 +03:00
ret = copy_for_split ( trans , path , l , right , slot , mid , nritems ) ;
if ( ret < 0 ) {
btrfs_tree_unlock ( right ) ;
free_extent_buffer ( right ) ;
return ret ;
}
2008-09-23 21:14:14 +04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
if ( split = = 2 ) {
2007-10-25 23:42:57 +04:00
BUG_ON ( num_doubles ! = 0 ) ;
num_doubles + + ;
goto again ;
2007-04-19 00:15:28 +04:00
}
2009-03-13 17:04:31 +03:00
2012-03-01 17:56:26 +04:00
return 0 ;
2010-07-07 18:51:48 +04:00
push_for_double :
push_for_double_split ( trans , root , path , data_size ) ;
tried_avoid_double = 1 ;
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( path - > nodes [ 0 ] ) > = data_size )
2010-07-07 18:51:48 +04:00
return 0 ;
goto again ;
2007-01-26 23:51:26 +03:00
}
2009-11-12 12:33:58 +03:00
static noinline int setup_leaf_for_split ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , int ins_len )
2008-12-10 17:10:46 +03:00
{
2009-11-12 12:33:58 +03:00
struct btrfs_key key ;
2008-12-10 17:10:46 +03:00
struct extent_buffer * leaf ;
2009-11-12 12:33:58 +03:00
struct btrfs_file_extent_item * fi ;
u64 extent_len = 0 ;
u32 item_size ;
int ret ;
2008-12-10 17:10:46 +03:00
leaf = path - > nodes [ 0 ] ;
2009-11-12 12:33:58 +03:00
btrfs_item_key_to_cpu ( leaf , & key , path - > slots [ 0 ] ) ;
BUG_ON ( key . type ! = BTRFS_EXTENT_DATA_KEY & &
key . type ! = BTRFS_EXTENT_CSUM_KEY ) ;
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( leaf ) > = ins_len )
2009-11-12 12:33:58 +03:00
return 0 ;
2008-12-10 17:10:46 +03:00
2021-10-21 21:58:35 +03:00
item_size = btrfs_item_size ( leaf , path - > slots [ 0 ] ) ;
2009-11-12 12:33:58 +03:00
if ( key . type = = BTRFS_EXTENT_DATA_KEY ) {
fi = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_file_extent_item ) ;
extent_len = btrfs_file_extent_num_bytes ( leaf , fi ) ;
}
2011-04-21 03:20:15 +04:00
btrfs_release_path ( path ) ;
2008-12-10 17:10:46 +03:00
path - > keep_locks = 1 ;
2009-11-12 12:33:58 +03:00
path - > search_for_split = 1 ;
ret = btrfs_search_slot ( trans , root , & key , path , 0 , 1 ) ;
2008-12-10 17:10:46 +03:00
path - > search_for_split = 0 ;
2015-01-20 15:40:53 +03:00
if ( ret > 0 )
ret = - EAGAIN ;
2009-11-12 12:33:58 +03:00
if ( ret < 0 )
goto err ;
2008-12-10 17:10:46 +03:00
2009-11-12 12:33:58 +03:00
ret = - EAGAIN ;
leaf = path - > nodes [ 0 ] ;
2015-01-20 15:40:53 +03:00
/* if our item isn't there, return now */
2021-10-21 21:58:35 +03:00
if ( item_size ! = btrfs_item_size ( leaf , path - > slots [ 0 ] ) )
2009-11-12 12:33:58 +03:00
goto err ;
2010-04-02 17:20:18 +04:00
/* the leaf has changed, it now has room. return now */
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( path - > nodes [ 0 ] ) > = ins_len )
2010-04-02 17:20:18 +04:00
goto err ;
2009-11-12 12:33:58 +03:00
if ( key . type = = BTRFS_EXTENT_DATA_KEY ) {
fi = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_file_extent_item ) ;
if ( extent_len ! = btrfs_file_extent_num_bytes ( leaf , fi ) )
goto err ;
2008-12-10 17:10:46 +03:00
}
2009-11-12 12:33:58 +03:00
ret = split_leaf ( trans , root , & key , path , ins_len , 1 ) ;
2010-05-16 18:46:25 +04:00
if ( ret )
goto err ;
2008-12-10 17:10:46 +03:00
2009-11-12 12:33:58 +03:00
path - > keep_locks = 0 ;
2009-03-13 18:00:37 +03:00
btrfs_unlock_up_safe ( path , 1 ) ;
2009-11-12 12:33:58 +03:00
return 0 ;
err :
path - > keep_locks = 0 ;
return ret ;
}
2019-03-20 16:44:57 +03:00
static noinline int split_item ( struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * new_key ,
2009-11-12 12:33:58 +03:00
unsigned long split_offset )
{
struct extent_buffer * leaf ;
2021-10-21 21:58:32 +03:00
int orig_slot , slot ;
2009-11-12 12:33:58 +03:00
char * buf ;
u32 nritems ;
u32 item_size ;
u32 orig_offset ;
struct btrfs_disk_key disk_key ;
2009-03-13 18:00:37 +03:00
leaf = path - > nodes [ 0 ] ;
2023-06-09 13:49:07 +03:00
/*
* Shouldn ' t happen because the caller must have previously called
* setup_leaf_for_split ( ) to make room for the new item in the leaf .
*/
if ( WARN_ON ( btrfs_leaf_free_space ( leaf ) < sizeof ( struct btrfs_item ) ) )
return - ENOSPC ;
2009-03-13 18:00:37 +03:00
2021-10-21 21:58:32 +03:00
orig_slot = path - > slots [ 0 ] ;
2021-10-21 21:58:35 +03:00
orig_offset = btrfs_item_offset ( leaf , path - > slots [ 0 ] ) ;
item_size = btrfs_item_size ( leaf , path - > slots [ 0 ] ) ;
2008-12-10 17:10:46 +03:00
buf = kmalloc ( item_size , GFP_NOFS ) ;
2009-11-12 12:33:58 +03:00
if ( ! buf )
return - ENOMEM ;
2008-12-10 17:10:46 +03:00
read_extent_buffer ( leaf , buf , btrfs_item_ptr_offset ( leaf ,
path - > slots [ 0 ] ) , item_size ) ;
2009-11-12 12:33:58 +03:00
slot = path - > slots [ 0 ] + 1 ;
2008-12-10 17:10:46 +03:00
nritems = btrfs_header_nritems ( leaf ) ;
if ( slot ! = nritems ) {
/* shift the items */
2022-11-15 19:16:17 +03:00
memmove_leaf_items ( leaf , slot + 1 , slot , nritems - slot ) ;
2008-12-10 17:10:46 +03:00
}
btrfs_cpu_key_to_disk ( & disk_key , new_key ) ;
btrfs_set_item_key ( leaf , & disk_key , slot ) ;
2021-10-21 21:58:35 +03:00
btrfs_set_item_offset ( leaf , slot , orig_offset ) ;
btrfs_set_item_size ( leaf , slot , item_size - split_offset ) ;
2008-12-10 17:10:46 +03:00
2021-10-21 21:58:35 +03:00
btrfs_set_item_offset ( leaf , orig_slot ,
2021-10-21 21:58:32 +03:00
orig_offset + item_size - split_offset ) ;
2021-10-21 21:58:35 +03:00
btrfs_set_item_size ( leaf , orig_slot , split_offset ) ;
2008-12-10 17:10:46 +03:00
btrfs_set_header_nritems ( leaf , nritems + 1 ) ;
/* write the data for the start of the original item */
write_extent_buffer ( leaf , buf ,
btrfs_item_ptr_offset ( leaf , path - > slots [ 0 ] ) ,
split_offset ) ;
/* write the data for the new item */
write_extent_buffer ( leaf , buf + split_offset ,
btrfs_item_ptr_offset ( leaf , slot ) ,
item_size - split_offset ) ;
btrfs_mark_buffer_dirty ( leaf ) ;
2019-03-20 16:36:46 +03:00
BUG_ON ( btrfs_leaf_free_space ( leaf ) < 0 ) ;
2008-12-10 17:10:46 +03:00
kfree ( buf ) ;
2009-11-12 12:33:58 +03:00
return 0 ;
}
/*
* This function splits a single item into two items ,
* giving ' new_key ' to the new item and splitting the
* old one at split_offset ( from the start of the item ) .
*
* The path may be released by this operation . After
* the split , the path is pointing to the old item . The
* new item is going to be in the same node as the old one .
*
* Note , the item being split must be smaller enough to live alone on
* a tree block with room for one extra struct btrfs_item
*
* This allows us to split the item in place , keeping a lock on the
* leaf the entire time .
*/
int btrfs_split_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * new_key ,
2009-11-12 12:33:58 +03:00
unsigned long split_offset )
{
int ret ;
ret = setup_leaf_for_split ( trans , root , path ,
sizeof ( struct btrfs_item ) ) ;
if ( ret )
return ret ;
2019-03-20 16:44:57 +03:00
ret = split_item ( path , new_key , split_offset ) ;
2008-12-10 17:10:46 +03:00
return ret ;
}
2008-09-29 23:18:18 +04:00
/*
* make the item pointed to by the path smaller . new_size indicates
* how small to make it , and from_end tells us if we just chop bytes
* off the end of the item or if we shift the item to chop bytes off
* the front .
*/
2019-03-20 16:49:12 +03:00
void btrfs_truncate_item ( struct btrfs_path * path , u32 new_size , int from_end )
2007-04-17 21:26:50 +04:00
{
int slot ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * leaf ;
2007-04-17 21:26:50 +04:00
u32 nritems ;
unsigned int data_end ;
unsigned int old_data_start ;
unsigned int old_size ;
unsigned int size_diff ;
int i ;
2012-03-03 16:40:03 +04:00
struct btrfs_map_token token ;
2007-10-16 00:14:19 +04:00
leaf = path - > nodes [ 0 ] ;
2007-11-01 18:28:41 +03:00
slot = path - > slots [ 0 ] ;
2021-10-21 21:58:35 +03:00
old_size = btrfs_item_size ( leaf , slot ) ;
2007-11-01 18:28:41 +03:00
if ( old_size = = new_size )
2012-03-01 17:56:26 +04:00
return ;
2007-04-17 21:26:50 +04:00
2007-10-16 00:14:19 +04:00
nritems = btrfs_header_nritems ( leaf ) ;
2019-03-20 13:33:10 +03:00
data_end = leaf_data_end ( leaf ) ;
2007-04-17 21:26:50 +04:00
2021-10-21 21:58:35 +03:00
old_data_start = btrfs_item_offset ( leaf , slot ) ;
2007-11-01 18:28:41 +03:00
2007-04-17 21:26:50 +04:00
size_diff = old_size - new_size ;
BUG_ON ( slot < 0 ) ;
BUG_ON ( slot > = nritems ) ;
/*
* item0 . . itemN . . . dataN . offset . . dataN . size . . data0 . size
*/
/* first correct the data pointers */
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , leaf ) ;
2007-04-17 21:26:50 +04:00
for ( i = slot ; i < nritems ; i + + ) {
2007-10-16 00:14:19 +04:00
u32 ioff ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
ioff = btrfs_token_item_offset ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i , ioff + size_diff ) ;
2007-04-17 21:26:50 +04:00
}
2007-10-16 00:15:53 +04:00
2007-04-17 21:26:50 +04:00
/* shift the data */
2007-11-01 18:28:41 +03:00
if ( from_end ) {
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( leaf , data_end + size_diff , data_end ,
old_data_start + new_size - data_end ) ;
2007-11-01 18:28:41 +03:00
} else {
struct btrfs_disk_key disk_key ;
u64 offset ;
btrfs_item_key ( leaf , & disk_key , slot ) ;
if ( btrfs_disk_key_type ( & disk_key ) = = BTRFS_EXTENT_DATA_KEY ) {
unsigned long ptr ;
struct btrfs_file_extent_item * fi ;
fi = btrfs_item_ptr ( leaf , slot ,
struct btrfs_file_extent_item ) ;
fi = ( struct btrfs_file_extent_item * ) (
( unsigned long ) fi - size_diff ) ;
if ( btrfs_file_extent_type ( leaf , fi ) = =
BTRFS_FILE_EXTENT_INLINE ) {
ptr = btrfs_item_ptr_offset ( leaf , slot ) ;
memmove_extent_buffer ( leaf , ptr ,
2009-01-06 05:25:51 +03:00
( unsigned long ) fi ,
2014-07-24 19:34:58 +04:00
BTRFS_FILE_EXTENT_INLINE_DATA_START ) ;
2007-11-01 18:28:41 +03:00
}
}
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( leaf , data_end + size_diff , data_end ,
old_data_start - data_end ) ;
2007-11-01 18:28:41 +03:00
offset = btrfs_disk_key_offset ( & disk_key ) ;
btrfs_set_disk_key_offset ( & disk_key , offset + size_diff ) ;
btrfs_set_item_key ( leaf , & disk_key , slot ) ;
if ( slot = = 0 )
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , 1 ) ;
2007-11-01 18:28:41 +03:00
}
2007-10-16 00:14:19 +04:00
2021-10-21 21:58:35 +03:00
btrfs_set_item_size ( leaf , slot , new_size ) ;
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( leaf ) ;
2007-04-17 21:26:50 +04:00
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( leaf ) < 0 ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2007-04-17 21:26:50 +04:00
BUG ( ) ;
2007-10-16 00:14:19 +04:00
}
2007-04-17 21:26:50 +04:00
}
2008-09-29 23:18:18 +04:00
/*
2013-05-07 14:23:30 +04:00
* make the item pointed to by the path bigger , data_size is the added size .
2008-09-29 23:18:18 +04:00
*/
2019-03-20 16:51:10 +03:00
void btrfs_extend_item ( struct btrfs_path * path , u32 data_size )
2007-04-16 17:22:45 +04:00
{
int slot ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * leaf ;
2007-04-16 17:22:45 +04:00
u32 nritems ;
unsigned int data_end ;
unsigned int old_data ;
unsigned int old_size ;
int i ;
2012-03-03 16:40:03 +04:00
struct btrfs_map_token token ;
2007-10-16 00:14:19 +04:00
leaf = path - > nodes [ 0 ] ;
2007-04-16 17:22:45 +04:00
2007-10-16 00:14:19 +04:00
nritems = btrfs_header_nritems ( leaf ) ;
2019-03-20 13:33:10 +03:00
data_end = leaf_data_end ( leaf ) ;
2007-04-16 17:22:45 +04:00
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( leaf ) < data_size ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2007-04-16 17:22:45 +04:00
BUG ( ) ;
2007-10-16 00:14:19 +04:00
}
2007-04-16 17:22:45 +04:00
slot = path - > slots [ 0 ] ;
2021-10-21 21:58:37 +03:00
old_data = btrfs_item_data_end ( leaf , slot ) ;
2007-04-16 17:22:45 +04:00
BUG_ON ( slot < 0 ) ;
2007-10-16 00:18:25 +04:00
if ( slot > = nritems ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2019-03-20 16:51:10 +03:00
btrfs_crit ( leaf - > fs_info , " slot %d too large, nritems %d " ,
2016-06-23 01:54:23 +03:00
slot , nritems ) ;
btrfs: use BUG() instead of BUG_ON(1)
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0
Change it to BUG() so clang can see that this code path can never
continue.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-25 16:02:25 +03:00
BUG ( ) ;
2007-10-16 00:18:25 +04:00
}
2007-04-16 17:22:45 +04:00
/*
* item0 . . itemN . . . dataN . offset . . dataN . size . . data0 . size
*/
/* first correct the data pointers */
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , leaf ) ;
2007-04-16 17:22:45 +04:00
for ( i = slot ; i < nritems ; i + + ) {
2007-10-16 00:14:19 +04:00
u32 ioff ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
ioff = btrfs_token_item_offset ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i , ioff - data_size ) ;
2007-04-16 17:22:45 +04:00
}
2007-10-16 00:14:19 +04:00
2007-04-16 17:22:45 +04:00
/* shift the data */
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( leaf , data_end - data_size , data_end ,
old_data - data_end ) ;
2007-10-16 00:14:19 +04:00
2007-04-16 17:22:45 +04:00
data_end = old_data ;
2021-10-21 21:58:35 +03:00
old_size = btrfs_item_size ( leaf , slot ) ;
btrfs_set_item_size ( leaf , slot , old_size + data_size ) ;
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( leaf ) ;
2007-04-16 17:22:45 +04:00
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( leaf ) < 0 ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2007-04-16 17:22:45 +04:00
BUG ( ) ;
2007-10-16 00:14:19 +04:00
}
2007-04-16 17:22:45 +04:00
}
2022-10-27 15:21:42 +03:00
/*
* Make space in the node before inserting one or more items .
2020-09-01 17:40:00 +03:00
*
* @ root : root we are inserting items to
* @ path : points to the leaf / slot where we are going to insert new items
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
* @ batch : information about the batch of items to insert
2022-10-27 15:21:42 +03:00
*
* Main purpose is to save stack depth by doing the bulk of the work in a
* function that doesn ' t call btrfs_search_slot
2007-02-02 19:05:29 +03:00
*/
2021-09-24 14:28:14 +03:00
static void setup_items_for_insert ( struct btrfs_root * root , struct btrfs_path * path ,
const struct btrfs_item_batch * batch )
2007-01-26 23:51:26 +03:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2008-01-29 23:15:18 +03:00
int i ;
2007-03-12 19:01:18 +03:00
u32 nritems ;
2007-01-26 23:51:26 +03:00
unsigned int data_end ;
2007-03-12 23:22:34 +03:00
struct btrfs_disk_key disk_key ;
2009-03-13 17:04:31 +03:00
struct extent_buffer * leaf ;
int slot ;
2012-03-03 16:40:03 +04:00
struct btrfs_map_token token ;
2020-09-01 17:39:59 +03:00
u32 total_size ;
2012-03-03 16:40:03 +04:00
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
/*
* Before anything else , update keys in the parent and other ancestors
* if needed , then release the write locks on them , so that other tasks
* can use them while we modify the leaf .
*/
2014-07-28 22:34:35 +04:00
if ( path - > slots [ 0 ] = = 0 ) {
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
btrfs_cpu_key_to_disk ( & disk_key , & batch - > keys [ 0 ] ) ;
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , 1 ) ;
2014-07-28 22:34:35 +04:00
}
btrfs_unlock_up_safe ( path , 1 ) ;
2007-10-16 00:14:19 +04:00
leaf = path - > nodes [ 0 ] ;
2009-03-13 17:04:31 +03:00
slot = path - > slots [ 0 ] ;
2007-02-02 19:05:29 +03:00
2007-10-16 00:14:19 +04:00
nritems = btrfs_header_nritems ( leaf ) ;
2019-03-20 13:33:10 +03:00
data_end = leaf_data_end ( leaf ) ;
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
total_size = batch - > total_data_size + ( batch - > nr * sizeof ( struct btrfs_item ) ) ;
2007-02-02 17:18:22 +03:00
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( leaf ) < total_size ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2016-06-23 01:54:23 +03:00
btrfs_crit ( fs_info , " not enough freespace need %u have %d " ,
2019-03-20 16:36:46 +03:00
total_size , btrfs_leaf_free_space ( leaf ) ) ;
2007-01-26 23:51:26 +03:00
BUG ( ) ;
2007-04-04 22:08:15 +04:00
}
2007-10-16 00:14:19 +04:00
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , leaf ) ;
2007-01-26 23:51:26 +03:00
if ( slot ! = nritems ) {
2021-10-21 21:58:37 +03:00
unsigned int old_data = btrfs_item_data_end ( leaf , slot ) ;
2007-01-26 23:51:26 +03:00
2007-10-16 00:14:19 +04:00
if ( old_data < data_end ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2020-09-01 17:40:01 +03:00
btrfs_crit ( fs_info ,
" item at slot %d with data offset %u beyond data end of leaf %u " ,
2016-09-20 17:05:00 +03:00
slot , old_data , data_end ) ;
btrfs: use BUG() instead of BUG_ON(1)
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0
Change it to BUG() so clang can see that this code path can never
continue.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-25 16:02:25 +03:00
BUG ( ) ;
2007-10-16 00:14:19 +04:00
}
2007-01-26 23:51:26 +03:00
/*
* item0 . . itemN . . . dataN . offset . . dataN . size . . data0 . size
*/
/* first correct the data pointers */
2007-03-13 03:12:07 +03:00
for ( i = slot ; i < nritems ; i + + ) {
2007-10-16 00:14:19 +04:00
u32 ioff ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
ioff = btrfs_token_item_offset ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i ,
2021-10-21 21:58:34 +03:00
ioff - batch - > total_data_size ) ;
2007-03-13 03:12:07 +03:00
}
2007-01-26 23:51:26 +03:00
/* shift the items */
2022-11-15 19:16:17 +03:00
memmove_leaf_items ( leaf , slot + batch - > nr , slot , nritems - slot ) ;
2007-01-26 23:51:26 +03:00
/* shift the data */
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( leaf , data_end - batch - > total_data_size ,
data_end , old_data - data_end ) ;
2007-01-26 23:51:26 +03:00
data_end = old_data ;
}
2007-10-16 00:14:19 +04:00
2007-03-15 19:56:47 +03:00
/* setup the item for the new data */
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
for ( i = 0 ; i < batch - > nr ; i + + ) {
btrfs_cpu_key_to_disk ( & disk_key , & batch - > keys [ i ] ) ;
2008-01-29 23:15:18 +03:00
btrfs_set_item_key ( leaf , & disk_key , slot + i ) ;
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
data_end - = batch - > data_sizes [ i ] ;
2021-10-21 21:58:35 +03:00
btrfs_set_token_item_offset ( & token , slot + i , data_end ) ;
btrfs_set_token_item_size ( & token , slot + i , batch - > data_sizes [ i ] ) ;
2008-01-29 23:15:18 +03:00
}
2009-03-13 17:04:31 +03:00
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
btrfs_set_header_nritems ( leaf , nritems + batch - > nr ) ;
2009-03-13 18:00:37 +03:00
btrfs_mark_buffer_dirty ( leaf ) ;
2007-03-01 00:35:06 +03:00
2019-03-20 16:36:46 +03:00
if ( btrfs_leaf_free_space ( leaf ) < 0 ) {
2017-06-29 19:37:49 +03:00
btrfs_print_leaf ( leaf ) ;
2007-01-26 23:51:26 +03:00
BUG ( ) ;
2007-10-16 00:14:19 +04:00
}
2009-03-13 17:04:31 +03:00
}
2021-09-24 14:28:14 +03:00
/*
* Insert a new item into a leaf .
*
* @ root : The root of the btree .
* @ path : A path pointing to the target leaf and slot .
* @ key : The key of the new item .
* @ data_size : The size of the data associated with the new key .
*/
void btrfs_setup_item_for_insert ( struct btrfs_root * root ,
struct btrfs_path * path ,
const struct btrfs_key * key ,
u32 data_size )
{
struct btrfs_item_batch batch ;
batch . keys = key ;
batch . data_sizes = & data_size ;
batch . total_data_size = data_size ;
batch . nr = 1 ;
setup_items_for_insert ( root , path , & batch ) ;
}
2009-03-13 17:04:31 +03:00
/*
* Given a key and some data , insert items into the tree .
* This does all the path init required , making room in the tree if needed .
*/
int btrfs_insert_empty_items ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
const struct btrfs_item_batch * batch )
2009-03-13 17:04:31 +03:00
{
int ret = 0 ;
int slot ;
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
u32 total_size ;
2009-03-13 17:04:31 +03:00
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
total_size = batch - > total_data_size + ( batch - > nr * sizeof ( struct btrfs_item ) ) ;
ret = btrfs_search_slot ( trans , root , & batch - > keys [ 0 ] , path , total_size , 1 ) ;
2009-03-13 17:04:31 +03:00
if ( ret = = 0 )
return - EEXIST ;
if ( ret < 0 )
2012-03-01 17:56:26 +04:00
return ret ;
2009-03-13 17:04:31 +03:00
slot = path - > slots [ 0 ] ;
BUG_ON ( slot < 0 ) ;
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 14:28:13 +03:00
setup_items_for_insert ( root , path , batch ) ;
2012-03-01 17:56:26 +04:00
return 0 ;
2007-03-15 19:56:47 +03:00
}
/*
* Given a key and some data , insert an item into the tree .
* This does all the path init required , making room in the tree if needed .
*/
2017-01-18 10:24:37 +03:00
int btrfs_insert_item ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * cpu_key , void * data ,
u32 data_size )
2007-03-15 19:56:47 +03:00
{
int ret = 0 ;
2007-04-02 18:50:19 +04:00
struct btrfs_path * path ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * leaf ;
unsigned long ptr ;
2007-03-15 19:56:47 +03:00
2007-04-02 18:50:19 +04:00
path = btrfs_alloc_path ( ) ;
2011-03-23 11:14:16 +03:00
if ( ! path )
return - ENOMEM ;
2007-04-02 18:50:19 +04:00
ret = btrfs_insert_empty_item ( trans , root , path , cpu_key , data_size ) ;
2007-03-15 19:56:47 +03:00
if ( ! ret ) {
2007-10-16 00:14:19 +04:00
leaf = path - > nodes [ 0 ] ;
ptr = btrfs_item_ptr_offset ( leaf , path - > slots [ 0 ] ) ;
write_extent_buffer ( leaf , data , ptr , data_size ) ;
btrfs_mark_buffer_dirty ( leaf ) ;
2007-03-15 19:56:47 +03:00
}
2007-04-02 18:50:19 +04:00
btrfs_free_path ( path ) ;
2007-03-01 00:35:06 +03:00
return ret ;
2007-01-26 23:51:26 +03:00
}
2021-09-24 14:28:14 +03:00
/*
* This function duplicates an item , giving ' new_key ' to the new item .
* It guarantees both items live in the same tree leaf and the new item is
* contiguous with the original item .
*
* This allows us to split a file extent in place , keeping a lock on the leaf
* the entire time .
*/
int btrfs_duplicate_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
const struct btrfs_key * new_key )
{
struct extent_buffer * leaf ;
int ret ;
u32 item_size ;
leaf = path - > nodes [ 0 ] ;
2021-10-21 21:58:35 +03:00
item_size = btrfs_item_size ( leaf , path - > slots [ 0 ] ) ;
2021-09-24 14:28:14 +03:00
ret = setup_leaf_for_split ( trans , root , path ,
item_size + sizeof ( struct btrfs_item ) ) ;
if ( ret )
return ret ;
path - > slots [ 0 ] + + ;
btrfs_setup_item_for_insert ( root , path , new_key , item_size ) ;
leaf = path - > nodes [ 0 ] ;
memcpy_extent_buffer ( leaf ,
btrfs_item_ptr_offset ( leaf , path - > slots [ 0 ] ) ,
btrfs_item_ptr_offset ( leaf , path - > slots [ 0 ] - 1 ) ,
item_size ) ;
return 0 ;
}
2007-02-02 19:05:29 +03:00
/*
2007-02-24 14:24:44 +03:00
* delete the pointer from a given node .
2007-02-02 19:05:29 +03:00
*
2008-09-29 23:18:18 +04:00
* the tree should have been previously balanced so the deletion does not
* empty a node .
2023-04-29 23:07:21 +03:00
*
* This is exported for use inside btrfs - progs , don ' t un - export it .
2007-02-02 19:05:29 +03:00
*/
2023-06-08 13:27:49 +03:00
int btrfs_del_ptr ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
struct btrfs_path * path , int level , int slot )
2007-01-26 23:51:26 +03:00
{
2007-10-16 00:14:19 +04:00
struct extent_buffer * parent = path - > nodes [ level ] ;
2007-03-12 19:01:18 +03:00
u32 nritems ;
2012-05-26 13:45:21 +04:00
int ret ;
2007-01-26 23:51:26 +03:00
2007-10-16 00:14:19 +04:00
nritems = btrfs_header_nritems ( parent ) ;
2009-01-06 05:25:51 +03:00
if ( slot ! = nritems - 1 ) {
2018-03-05 17:47:39 +03:00
if ( level ) {
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_move ( parent , slot ,
slot + 1 , nritems - slot - 1 ) ;
2023-06-08 13:27:49 +03:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2018-03-05 17:47:39 +03:00
}
2007-10-16 00:14:19 +04:00
memmove_extent_buffer ( parent ,
2022-11-15 19:16:16 +03:00
btrfs_node_key_ptr_offset ( parent , slot ) ,
btrfs_node_key_ptr_offset ( parent , slot + 1 ) ,
2007-03-30 22:27:56 +04:00
sizeof ( struct btrfs_key_ptr ) *
( nritems - slot - 1 ) ) ;
2012-12-19 04:35:32 +04:00
} else if ( level ) {
2021-03-11 17:31:07 +03:00
ret = btrfs_tree_mod_log_insert_key ( parent , slot ,
2022-10-14 16:44:33 +03:00
BTRFS_MOD_LOG_KEY_REMOVE ) ;
2023-06-08 13:27:49 +03:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
2007-03-01 20:04:21 +03:00
}
2012-05-26 13:45:21 +04:00
2007-03-12 19:01:18 +03:00
nritems - - ;
2007-10-16 00:14:19 +04:00
btrfs_set_header_nritems ( parent , nritems ) ;
2007-03-12 19:01:18 +03:00
if ( nritems = = 0 & & parent = = root - > node ) {
2007-10-16 00:14:19 +04:00
BUG_ON ( btrfs_header_level ( root - > node ) ! = 1 ) ;
2007-03-01 20:04:21 +03:00
/* just turn the root into a leaf and break */
2007-10-16 00:14:19 +04:00
btrfs_set_header_level ( root - > node , 0 ) ;
2007-03-01 20:04:21 +03:00
} else if ( slot = = 0 ) {
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
btrfs_node_key ( parent , & disk_key , 0 ) ;
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , level + 1 ) ;
2007-01-26 23:51:26 +03:00
}
2007-03-30 22:27:56 +04:00
btrfs_mark_buffer_dirty ( parent ) ;
2023-06-08 13:27:49 +03:00
return 0 ;
2007-01-26 23:51:26 +03:00
}
2008-10-02 03:05:46 +04:00
/*
* a helper function to delete the leaf pointed to by path - > slots [ 1 ] and
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
* path - > nodes [ 1 ] .
2008-10-02 03:05:46 +04:00
*
* This deletes the pointer in path - > nodes [ 1 ] and frees the leaf
* block extent . zero is returned if it all worked out , < 0 otherwise .
*
* The path must have already been setup for deleting the leaf , including
* all the proper balancing . path - > nodes [ 1 ] must be locked .
*/
2023-06-08 13:27:49 +03:00
static noinline int btrfs_del_leaf ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
struct extent_buffer * leaf )
2008-10-02 03:05:46 +04:00
{
2023-06-08 13:27:49 +03:00
int ret ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
WARN_ON ( btrfs_header_generation ( leaf ) ! = trans - > transid ) ;
2023-06-08 13:27:49 +03:00
ret = btrfs_del_ptr ( trans , root , path , 1 , path - > slots [ 1 ] ) ;
if ( ret < 0 )
return ret ;
2008-10-02 03:05:46 +04:00
2009-02-04 17:31:28 +03:00
/*
* btrfs_free_extent is expensive , we want to make sure we
* aren ' t holding any locks when we call it
*/
btrfs_unlock_up_safe ( path , 0 ) ;
2023-09-08 02:09:33 +03:00
root_sub_used_bytes ( root ) ;
2010-05-16 18:46:25 +04:00
2019-10-08 14:28:47 +03:00
atomic_inc ( & leaf - > refs ) ;
2021-12-13 11:45:12 +03:00
btrfs_free_tree_block ( trans , btrfs_root_id ( root ) , leaf , 0 , 1 ) ;
2012-03-10 01:01:49 +04:00
free_extent_buffer_stale ( leaf ) ;
2023-06-08 13:27:49 +03:00
return 0 ;
2008-10-02 03:05:46 +04:00
}
2007-02-02 19:05:29 +03:00
/*
* delete the item at the leaf level in path . If that empties
* the leaf , remove it from the tree
*/
2008-01-29 23:11:36 +03:00
int btrfs_del_items ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
struct btrfs_path * path , int slot , int nr )
2007-01-26 23:51:26 +03:00
{
2016-06-23 01:54:23 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * leaf ;
2007-03-01 00:35:06 +03:00
int ret = 0 ;
int wret ;
2007-03-12 19:01:18 +03:00
u32 nritems ;
2007-01-26 23:51:26 +03:00
2007-10-16 00:14:19 +04:00
leaf = path - > nodes [ 0 ] ;
nritems = btrfs_header_nritems ( leaf ) ;
2007-01-26 23:51:26 +03:00
2008-01-29 23:11:36 +03:00
if ( slot + nr ! = nritems ) {
2022-02-03 17:55:47 +03:00
const u32 last_off = btrfs_item_offset ( leaf , slot + nr - 1 ) ;
const int data_end = leaf_data_end ( leaf ) ;
2019-08-09 18:48:21 +03:00
struct btrfs_map_token token ;
2022-02-03 17:55:47 +03:00
u32 dsize = 0 ;
int i ;
for ( i = 0 ; i < nr ; i + + )
dsize + = btrfs_item_size ( leaf , slot + i ) ;
2007-10-16 00:14:19 +04:00
2022-11-15 19:16:17 +03:00
memmove_leaf_data ( leaf , data_end + dsize , data_end ,
last_off - data_end ) ;
2007-10-16 00:14:19 +04:00
2019-08-09 18:48:21 +03:00
btrfs_init_map_token ( & token , leaf ) ;
2008-01-29 23:11:36 +03:00
for ( i = slot + nr ; i < nritems ; i + + ) {
2007-10-16 00:14:19 +04:00
u32 ioff ;
2007-10-16 00:15:53 +04:00
2021-10-21 21:58:35 +03:00
ioff = btrfs_token_item_offset ( & token , i ) ;
btrfs_set_token_item_offset ( & token , i , ioff + dsize ) ;
2007-03-13 03:12:07 +03:00
}
2007-10-16 00:15:53 +04:00
2022-11-15 19:16:17 +03:00
memmove_leaf_items ( leaf , slot , slot + nr , nritems - slot - nr ) ;
2007-01-26 23:51:26 +03:00
}
2008-01-29 23:11:36 +03:00
btrfs_set_header_nritems ( leaf , nritems - nr ) ;
nritems - = nr ;
2007-10-16 00:14:19 +04:00
2007-02-02 19:05:29 +03:00
/* delete the leaf if we've emptied it */
2007-03-12 19:01:18 +03:00
if ( nritems = = 0 ) {
2007-10-16 00:14:19 +04:00
if ( leaf = = root - > node ) {
btrfs_set_header_level ( leaf , 0 ) ;
2007-02-23 16:38:36 +03:00
} else {
2023-01-27 00:00:58 +03:00
btrfs_clear_buffer_dirty ( trans , leaf ) ;
2023-06-08 13:27:49 +03:00
ret = btrfs_del_leaf ( trans , root , path , leaf ) ;
if ( ret < 0 )
return ret ;
2007-02-23 16:38:36 +03:00
}
2007-01-26 23:51:26 +03:00
} else {
2007-03-12 19:01:18 +03:00
int used = leaf_space_used ( leaf , 0 , nritems ) ;
2007-03-01 00:35:06 +03:00
if ( slot = = 0 ) {
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
btrfs_item_key ( leaf , & disk_key , 0 ) ;
2018-06-20 15:48:47 +03:00
fixup_low_keys ( path , & disk_key , 1 ) ;
2007-03-01 00:35:06 +03:00
}
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-03 17:55:46 +03:00
/*
* Try to delete the leaf if it is mostly empty . We do this by
* trying to move all its items into its left and right neighbours .
* If we can ' t move all the items , then we don ' t delete it - it ' s
* not ideal , but future insertions might fill the leaf with more
* items , or items from other leaves might be moved later into our
* leaf due to deletions on those leaves .
*/
2016-06-23 01:54:23 +03:00
if ( used < BTRFS_LEAF_DATA_SIZE ( fs_info ) / 3 ) {
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-03 17:55:46 +03:00
u32 min_push_space ;
2007-01-26 23:51:26 +03:00
/* push_leaf_left fixes the path.
* make sure the path still points to our leaf
2023-04-29 23:07:21 +03:00
* for possible call to btrfs_del_ptr below
2007-01-26 23:51:26 +03:00
*/
2007-01-27 00:38:42 +03:00
slot = path - > slots [ 1 ] ;
2019-10-08 14:28:47 +03:00
atomic_inc ( & leaf - > refs ) ;
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-03 17:55:46 +03:00
/*
* We want to be able to at least push one item to the
* left neighbour leaf , and that ' s the first item .
*/
min_push_space = sizeof ( struct btrfs_item ) +
btrfs_item_size ( leaf , 0 ) ;
wret = push_leaf_left ( trans , root , path , 0 ,
min_push_space , 1 , ( u32 ) - 1 ) ;
2007-06-22 22:16:25 +04:00
if ( wret < 0 & & wret ! = - ENOSPC )
2007-03-01 00:35:06 +03:00
ret = wret ;
2007-10-16 00:14:19 +04:00
if ( path - > nodes [ 0 ] = = leaf & &
btrfs_header_nritems ( leaf ) ) {
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-03 17:55:46 +03:00
/*
* If we were not able to push all items from our
* leaf to its left neighbour , then attempt to
* either push all the remaining items to the
* right neighbour or none . There ' s no advantage
* in pushing only some items , instead of all , as
* it ' s pointless to end up with a leaf having
* too few items while the neighbours can be full
* or nearly full .
*/
nritems = btrfs_header_nritems ( leaf ) ;
min_push_space = leaf_space_used ( leaf , 0 , nritems ) ;
wret = push_leaf_right ( trans , root , path , 0 ,
min_push_space , 1 , 0 ) ;
2007-06-22 22:16:25 +04:00
if ( wret < 0 & & wret ! = - ENOSPC )
2007-03-01 00:35:06 +03:00
ret = wret ;
}
2007-10-16 00:14:19 +04:00
if ( btrfs_header_nritems ( leaf ) = = 0 ) {
2008-10-02 03:05:46 +04:00
path - > slots [ 1 ] = slot ;
2023-06-08 13:27:49 +03:00
ret = btrfs_del_leaf ( trans , root , path , leaf ) ;
if ( ret < 0 )
return ret ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( leaf ) ;
2012-03-01 17:56:26 +04:00
ret = 0 ;
2007-02-24 14:24:44 +03:00
} else {
2008-06-26 00:01:30 +04:00
/* if we're still in the path, make sure
* we ' re dirty . Otherwise , one of the
* push_leaf functions must have already
* dirtied this buffer
*/
if ( path - > nodes [ 0 ] = = leaf )
btrfs_mark_buffer_dirty ( leaf ) ;
2007-10-16 00:14:19 +04:00
free_extent_buffer ( leaf ) ;
2007-01-26 23:51:26 +03:00
}
2007-03-23 17:01:08 +03:00
} else {
2007-10-16 00:14:19 +04:00
btrfs_mark_buffer_dirty ( leaf ) ;
2007-01-26 23:51:26 +03:00
}
}
2007-03-01 00:35:06 +03:00
return ret ;
2007-01-26 23:51:26 +03:00
}
2008-06-26 00:01:31 +04:00
/*
* A helper function to walk down the tree starting at min_key , and looking
2013-01-31 22:21:12 +04:00
* for nodes or leaves that are have a minimum transaction id .
* This is used by the btree defrag code , and tree logging
2008-06-26 00:01:31 +04:00
*
* This does not cow , but it does stuff the starting key it finds back
* into min_key , so you can call btrfs_search_slot with cow = 1 on the
* key and get a writable path .
*
* This honors path - > lowest_level to prevent descent past a given level
* of the tree .
*
2008-09-29 23:18:18 +04:00
* min_trans indicates the oldest transaction that you are interested
* in walking through . Any nodes or leaves older than min_trans are
* skipped over ( without reading them ) .
*
2008-06-26 00:01:31 +04:00
* returns zero if something useful was found , < 0 on error and 1 if there
* was nothing in the tree that matched the search criteria .
*/
int btrfs_search_forward ( struct btrfs_root * root , struct btrfs_key * min_key ,
2013-01-31 22:21:12 +04:00
struct btrfs_path * path ,
2008-06-26 00:01:31 +04:00
u64 min_trans )
{
struct extent_buffer * cur ;
struct btrfs_key found_key ;
int slot ;
2008-07-24 20:19:49 +04:00
int sret ;
2008-06-26 00:01:31 +04:00
u32 nritems ;
int level ;
int ret = 1 ;
2014-08-04 22:37:21 +04:00
int keep_locks = path - > keep_locks ;
2008-06-26 00:01:31 +04:00
2022-09-12 22:27:51 +03:00
ASSERT ( ! path - > nowait ) ;
2014-08-04 22:37:21 +04:00
path - > keep_locks = 1 ;
2008-06-26 00:01:31 +04:00
again :
2011-07-16 23:23:14 +04:00
cur = btrfs_read_lock_root_node ( root ) ;
2008-06-26 00:01:31 +04:00
level = btrfs_header_level ( cur ) ;
2008-09-06 00:13:11 +04:00
WARN_ON ( path - > nodes [ level ] ) ;
2008-06-26 00:01:31 +04:00
path - > nodes [ level ] = cur ;
2011-07-16 23:23:14 +04:00
path - > locks [ level ] = BTRFS_READ_LOCK ;
2008-06-26 00:01:31 +04:00
if ( btrfs_header_generation ( cur ) < min_trans ) {
ret = 1 ;
goto out ;
}
2009-01-06 05:25:51 +03:00
while ( 1 ) {
2008-06-26 00:01:31 +04:00
nritems = btrfs_header_nritems ( cur ) ;
level = btrfs_header_level ( cur ) ;
2023-02-24 06:31:26 +03:00
sret = btrfs_bin_search ( cur , 0 , min_key , & slot ) ;
2019-02-18 19:57:26 +03:00
if ( sret < 0 ) {
ret = sret ;
goto out ;
}
2008-06-26 00:01:31 +04:00
2008-10-02 03:05:46 +04:00
/* at the lowest level, we're done, setup the path and exit */
if ( level = = path - > lowest_level ) {
2008-09-06 00:13:11 +04:00
if ( slot > = nritems )
goto find_next_key ;
2008-06-26 00:01:31 +04:00
ret = 0 ;
path - > slots [ level ] = slot ;
btrfs_item_key_to_cpu ( cur , & found_key , slot ) ;
goto out ;
}
2008-07-24 20:19:49 +04:00
if ( sret & & slot > 0 )
slot - - ;
2008-06-26 00:01:31 +04:00
/*
2013-01-31 22:21:12 +04:00
* check this node pointer against the min_trans parameters .
2020-08-05 05:48:34 +03:00
* If it is too old , skip to the next one .
2008-06-26 00:01:31 +04:00
*/
2009-01-06 05:25:51 +03:00
while ( slot < nritems ) {
2008-06-26 00:01:31 +04:00
u64 gen ;
2008-09-06 00:13:11 +04:00
2008-06-26 00:01:31 +04:00
gen = btrfs_node_ptr_generation ( cur , slot ) ;
if ( gen < min_trans ) {
slot + + ;
continue ;
}
2013-01-31 22:21:12 +04:00
break ;
2008-06-26 00:01:31 +04:00
}
2008-09-06 00:13:11 +04:00
find_next_key :
2008-06-26 00:01:31 +04:00
/*
* we didn ' t find a candidate key in this node , walk forward
* and find another one
*/
if ( slot > = nritems ) {
2008-09-06 00:13:11 +04:00
path - > slots [ level ] = slot ;
sret = btrfs_find_next_key ( root , path , min_key , level ,
2013-01-31 22:21:12 +04:00
min_trans ) ;
2008-09-06 00:13:11 +04:00
if ( sret = = 0 ) {
2011-04-21 03:20:15 +04:00
btrfs_release_path ( path ) ;
2008-06-26 00:01:31 +04:00
goto again ;
} else {
goto out ;
}
}
/* save our key for returning back */
btrfs_node_key_to_cpu ( cur , & found_key , slot ) ;
path - > slots [ level ] = slot ;
if ( level = = path - > lowest_level ) {
ret = 0 ;
goto out ;
}
2019-08-21 20:16:27 +03:00
cur = btrfs_read_node_slot ( cur , slot ) ;
2016-07-05 22:10:14 +03:00
if ( IS_ERR ( cur ) ) {
ret = PTR_ERR ( cur ) ;
goto out ;
}
2008-06-26 00:01:31 +04:00
2011-07-16 23:23:14 +04:00
btrfs_tree_read_lock ( cur ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2011-07-16 23:23:14 +04:00
path - > locks [ level - 1 ] = BTRFS_READ_LOCK ;
2008-06-26 00:01:31 +04:00
path - > nodes [ level - 1 ] = cur ;
2012-03-19 23:54:38 +04:00
unlock_up ( path , level , 1 , 0 , NULL ) ;
2008-06-26 00:01:31 +04:00
}
out :
2014-08-04 22:37:21 +04:00
path - > keep_locks = keep_locks ;
if ( ret = = 0 ) {
btrfs_unlock_up_safe ( path , path - > lowest_level + 1 ) ;
2008-06-26 00:01:31 +04:00
memcpy ( min_key , & found_key , sizeof ( found_key ) ) ;
2014-08-04 22:37:21 +04:00
}
2008-06-26 00:01:31 +04:00
return ret ;
}
/*
* this is similar to btrfs_next_leaf , but does not try to preserve
* and fixup the path . It looks for and returns the next key in the
2013-01-31 22:21:12 +04:00
* tree based on the current path and the min_trans parameters .
2008-06-26 00:01:31 +04:00
*
* 0 is returned if another key is found , < 0 if there are any errors
* and 1 is returned if there are no higher keys in the tree
*
* path - > keep_locks should be set to 1 on the search made before
* calling this function .
*/
2008-06-26 00:01:31 +04:00
int btrfs_find_next_key ( struct btrfs_root * root , struct btrfs_path * path ,
2013-01-31 22:21:12 +04:00
struct btrfs_key * key , int level , u64 min_trans )
2008-06-26 00:01:31 +04:00
{
int slot ;
struct extent_buffer * c ;
2019-06-20 22:37:52 +03:00
WARN_ON ( ! path - > keep_locks & & ! path - > skip_locking ) ;
2009-01-06 05:25:51 +03:00
while ( level < BTRFS_MAX_LEVEL ) {
2008-06-26 00:01:31 +04:00
if ( ! path - > nodes [ level ] )
return 1 ;
slot = path - > slots [ level ] + 1 ;
c = path - > nodes [ level ] ;
2008-06-26 00:01:31 +04:00
next :
2008-06-26 00:01:31 +04:00
if ( slot > = btrfs_header_nritems ( c ) ) {
2009-07-22 17:59:00 +04:00
int ret ;
int orig_lowest ;
struct btrfs_key cur_key ;
if ( level + 1 > = BTRFS_MAX_LEVEL | |
! path - > nodes [ level + 1 ] )
2008-06-26 00:01:31 +04:00
return 1 ;
2009-07-22 17:59:00 +04:00
2019-06-20 22:37:52 +03:00
if ( path - > locks [ level + 1 ] | | path - > skip_locking ) {
2009-07-22 17:59:00 +04:00
level + + ;
continue ;
}
slot = btrfs_header_nritems ( c ) - 1 ;
if ( level = = 0 )
btrfs_item_key_to_cpu ( c , & cur_key , slot ) ;
else
btrfs_node_key_to_cpu ( c , & cur_key , slot ) ;
orig_lowest = path - > lowest_level ;
2011-04-21 03:20:15 +04:00
btrfs_release_path ( path ) ;
2009-07-22 17:59:00 +04:00
path - > lowest_level = level ;
ret = btrfs_search_slot ( NULL , root , & cur_key , path ,
0 , 0 ) ;
path - > lowest_level = orig_lowest ;
if ( ret < 0 )
return ret ;
c = path - > nodes [ level ] ;
slot = path - > slots [ level ] ;
if ( ret = = 0 )
slot + + ;
goto next ;
2008-06-26 00:01:31 +04:00
}
2009-07-22 17:59:00 +04:00
2008-06-26 00:01:31 +04:00
if ( level = = 0 )
btrfs_item_key_to_cpu ( c , key , slot ) ;
2008-06-26 00:01:31 +04:00
else {
u64 gen = btrfs_node_ptr_generation ( c , slot ) ;
if ( gen < min_trans ) {
slot + + ;
goto next ;
}
2008-06-26 00:01:31 +04:00
btrfs_node_key_to_cpu ( c , key , slot ) ;
2008-06-26 00:01:31 +04:00
}
2008-06-26 00:01:31 +04:00
return 0 ;
}
return 1 ;
}
2012-06-11 10:29:29 +04:00
int btrfs_next_old_leaf ( struct btrfs_root * root , struct btrfs_path * path ,
u64 time_seq )
2007-02-21 00:40:44 +03:00
{
int slot ;
2009-04-03 18:14:18 +04:00
int level ;
2007-10-16 00:14:19 +04:00
struct extent_buffer * c ;
2009-04-03 18:14:18 +04:00
struct extent_buffer * next ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2008-06-26 00:01:30 +04:00
struct btrfs_key key ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
bool need_commit_sem = false ;
2008-06-26 00:01:30 +04:00
u32 nritems ;
int ret ;
btrfs: unlock to current level in btrfs_next_old_leaf
Filipe reported the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc2-btrfs-next-71 #1 Not tainted
------------------------------------------------------
find/324157 is trying to acquire lock:
ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
but task is already holding lock:
ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
lock_acquire+0xd8/0x490
down_write_nested+0x44/0x120
__btrfs_tree_lock+0x27/0x120 [btrfs]
btrfs_search_slot+0x2a3/0xc50 [btrfs]
btrfs_insert_empty_items+0x58/0xa0 [btrfs]
insert_with_overflow+0x44/0x110 [btrfs]
btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
btrfs_setxattr+0xd6/0x4c0 [btrfs]
btrfs_setxattr_trans+0x68/0x100 [btrfs]
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x70/0x200
vfs_setxattr+0x6b/0x120
setxattr+0x125/0x240
path_setxattr+0xba/0xd0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
lock_acquire+0xd8/0x490
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
*** DEADLOCK ***
5 locks held by find/324157:
#0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
#1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
#2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
stack backtrace:
CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
check_noncircular+0xff/0x110
? mark_lock.part.0+0x468/0xe90
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
lock_acquire+0xd8/0x490
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
down_read_nested+0x45/0x220
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
? filldir+0x1d0/0x1d0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because btrfs_next_old_leaf searches down to our current
key, and then walks up the path until we can move to the next slot, and
then reads back down the path so we get the next leaf.
However it doesn't unlock any lower levels until it replaces them with
the new extent buffer. This is technically fine, but of course causes
lockdep to complain, because we could be holding locks on lower levels
while locking upper levels.
Fix this by dropping all nodes below the level that we use as our new
starting point before we start reading back down the path. This also
allows us to drop the nested/recursive locking magic, because we're no
longer locking two nodes at the same level anymore.
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-07 00:27:30 +03:00
int i ;
2008-06-26 00:01:30 +04:00
btrfs: fix assertion failure and blocking during nowait buffered write
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-11 03:54:40 +03:00
/*
* The nowait semantics are used only for write paths , where we don ' t
* use the tree mod log and sequence numbers .
*/
if ( time_seq )
ASSERT ( ! path - > nowait ) ;
2022-09-12 22:27:51 +03:00
2008-06-26 00:01:30 +04:00
nritems = btrfs_header_nritems ( path - > nodes [ 0 ] ) ;
2009-01-06 05:25:51 +03:00
if ( nritems = = 0 )
2008-06-26 00:01:30 +04:00
return 1 ;
2009-04-03 18:14:18 +04:00
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & key , nritems - 1 ) ;
again :
level = 1 ;
next = NULL ;
2011-04-21 03:20:15 +04:00
btrfs_release_path ( path ) ;
2009-04-03 18:14:18 +04:00
2008-06-26 00:01:30 +04:00
path - > keep_locks = 1 ;
2009-04-03 18:14:18 +04:00
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
if ( time_seq ) {
2012-06-11 10:29:29 +04:00
ret = btrfs_search_old_slot ( root , & key , path , time_seq ) ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
} else {
if ( path - > need_commit_sem ) {
path - > need_commit_sem = 0 ;
need_commit_sem = true ;
btrfs: fix assertion failure and blocking during nowait buffered write
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-11 03:54:40 +03:00
if ( path - > nowait ) {
if ( ! down_read_trylock ( & fs_info - > commit_root_sem ) ) {
ret = - EAGAIN ;
goto done ;
}
} else {
down_read ( & fs_info - > commit_root_sem ) ;
}
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
}
2012-06-11 10:29:29 +04:00
ret = btrfs_search_slot ( NULL , root , & key , path , 0 , 0 ) ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
}
2008-06-26 00:01:30 +04:00
path - > keep_locks = 0 ;
if ( ret < 0 )
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
goto done ;
2008-06-26 00:01:30 +04:00
2008-06-26 00:01:30 +04:00
nritems = btrfs_header_nritems ( path - > nodes [ 0 ] ) ;
2008-06-26 00:01:30 +04:00
/*
* by releasing the path above we dropped all our locks . A balance
* could have added more items next to the key that used to be
* at the very end of the block . So , check again here and
* advance the path if there are now more items available .
*/
2008-06-26 00:01:30 +04:00
if ( nritems > 0 & & path - > slots [ 0 ] < nritems - 1 ) {
2009-07-22 17:59:00 +04:00
if ( ret = = 0 )
path - > slots [ 0 ] + + ;
2009-04-03 18:14:18 +04:00
ret = 0 ;
2008-06-26 00:01:30 +04:00
goto done ;
}
Btrfs: fix leaf corruption after __btrfs_drop_extents
Several reports about leaf corruption has been floating on the list, one of them
points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
after __btrfs_drop_extents(), it's really a rare case but it does exist.
The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
So in btrfs_next_leaf(), we release the current path to re-search the last key of
the leaf for locating next leaf, and we've taken it into account that there might
be balance operations between leafs during this 'unlock and re-lock' dance, so
we check the path again and advance it if there are now more items available.
But things are a bit different if that last key happens to be removed and balance
gets a bigger key as the last one, and btrfs_search_slot will return it with
ret > 0, IOW, nothing change in this leaf except the new last key, then we think
we're okay because there is no more item balanced in, fine, we thinks we can
go to the next leaf.
However, we should return that bigger key, otherwise we deserve leaf corruption,
for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
it has dropped all extent matched the required range and finish_ordered_io can
safely insert a new extent, but it actually doesn't and ends up a leaf
corruption.
One may be asking that why our locking on extent io tree doesn't work as
expected, ie. it should avoid this kind of race situation. But in
__btrfs_drop_extents(), we don't always find extents which are included within
our locking range, IOW, extents can start before our searching start, in this
case locking on extent io tree doesn't protect us from the race.
This takes the special case into account.
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 07:04:49 +04:00
/*
* So the above check misses one case :
* - after releasing the path above , someone has removed the item that
* used to be at the very end of the block , and balance between leafs
* gets another one with bigger key . offset to replace it .
*
* This one should be returned as well , or we can get leaf corruption
* later ( esp . in __btrfs_drop_extents ( ) ) .
*
* And a bit more explanation about this check ,
* with ret > 0 , the key isn ' t found , the path points to the slot
* where it should be inserted , so the path - > slots [ 0 ] item must be the
* bigger one .
*/
if ( nritems > 0 & & ret > 0 & & path - > slots [ 0 ] = = nritems - 1 ) {
ret = 0 ;
goto done ;
}
2007-02-21 00:40:44 +03:00
2009-01-06 05:25:51 +03:00
while ( level < BTRFS_MAX_LEVEL ) {
2009-04-03 18:14:18 +04:00
if ( ! path - > nodes [ level ] ) {
ret = 1 ;
goto done ;
}
2007-10-16 00:14:19 +04:00
2007-02-21 00:40:44 +03:00
slot = path - > slots [ level ] + 1 ;
c = path - > nodes [ level ] ;
2007-10-16 00:14:19 +04:00
if ( slot > = btrfs_header_nritems ( c ) ) {
2007-02-21 00:40:44 +03:00
level + + ;
2009-04-03 18:14:18 +04:00
if ( level = = BTRFS_MAX_LEVEL ) {
ret = 1 ;
goto done ;
}
2007-02-21 00:40:44 +03:00
continue ;
}
2007-10-16 00:14:19 +04:00
btrfs: unlock to current level in btrfs_next_old_leaf
Filipe reported the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc2-btrfs-next-71 #1 Not tainted
------------------------------------------------------
find/324157 is trying to acquire lock:
ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
but task is already holding lock:
ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
lock_acquire+0xd8/0x490
down_write_nested+0x44/0x120
__btrfs_tree_lock+0x27/0x120 [btrfs]
btrfs_search_slot+0x2a3/0xc50 [btrfs]
btrfs_insert_empty_items+0x58/0xa0 [btrfs]
insert_with_overflow+0x44/0x110 [btrfs]
btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
btrfs_setxattr+0xd6/0x4c0 [btrfs]
btrfs_setxattr_trans+0x68/0x100 [btrfs]
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x70/0x200
vfs_setxattr+0x6b/0x120
setxattr+0x125/0x240
path_setxattr+0xba/0xd0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
lock_acquire+0xd8/0x490
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
*** DEADLOCK ***
5 locks held by find/324157:
#0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
#1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
#2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
stack backtrace:
CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
check_noncircular+0xff/0x110
? mark_lock.part.0+0x468/0xe90
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
lock_acquire+0xd8/0x490
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
down_read_nested+0x45/0x220
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
? filldir+0x1d0/0x1d0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because btrfs_next_old_leaf searches down to our current
key, and then walks up the path until we can move to the next slot, and
then reads back down the path so we get the next leaf.
However it doesn't unlock any lower levels until it replaces them with
the new extent buffer. This is technically fine, but of course causes
lockdep to complain, because we could be holding locks on lower levels
while locking upper levels.
Fix this by dropping all nodes below the level that we use as our new
starting point before we start reading back down the path. This also
allows us to drop the nested/recursive locking magic, because we're no
longer locking two nodes at the same level anymore.
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-07 00:27:30 +03:00
/*
* Our current level is where we ' re going to start from , and to
* make sure lockdep doesn ' t complain we need to drop our locks
* and nodes from 0 to our current level .
*/
for ( i = 0 ; i < level ; i + + ) {
if ( path - > locks [ level ] ) {
btrfs_tree_read_unlock ( path - > nodes [ i ] ) ;
path - > locks [ i ] = 0 ;
}
free_extent_buffer ( path - > nodes [ i ] ) ;
path - > nodes [ i ] = NULL ;
2008-06-26 00:01:30 +04:00
}
2007-10-16 00:14:19 +04:00
2009-04-03 18:14:18 +04:00
next = c ;
2017-01-30 23:23:42 +03:00
ret = read_block_for_search ( root , path , & next , level ,
2017-02-10 20:44:32 +03:00
slot , & key ) ;
btrfs: fix assertion failure and blocking during nowait buffered write
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-11 03:54:40 +03:00
if ( ret = = - EAGAIN & & ! path - > nowait )
2009-04-03 18:14:18 +04:00
goto again ;
2007-10-16 00:14:19 +04:00
2009-05-14 21:24:30 +04:00
if ( ret < 0 ) {
2011-04-21 03:20:15 +04:00
btrfs_release_path ( path ) ;
2009-05-14 21:24:30 +04:00
goto done ;
}
2008-06-26 00:01:30 +04:00
if ( ! path - > skip_locking ) {
2011-07-16 23:23:14 +04:00
ret = btrfs_try_tree_read_lock ( next ) ;
btrfs: fix assertion failure and blocking during nowait buffered write
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-11 03:54:40 +03:00
if ( ! ret & & path - > nowait ) {
ret = - EAGAIN ;
goto done ;
}
2012-06-22 16:51:15 +04:00
if ( ! ret & & time_seq ) {
/*
* If we don ' t get the lock , we may be racing
* with push_leaf_left , holding that lock while
* itself waiting for the leaf we ' ve currently
* locked . To solve this situation , we give up
* on our lock and cycle .
*/
2012-07-04 17:42:48 +04:00
free_extent_buffer ( next ) ;
2012-06-22 16:51:15 +04:00
btrfs_release_path ( path ) ;
cond_resched ( ) ;
goto again ;
}
btrfs: unlock to current level in btrfs_next_old_leaf
Filipe reported the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc2-btrfs-next-71 #1 Not tainted
------------------------------------------------------
find/324157 is trying to acquire lock:
ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
but task is already holding lock:
ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
lock_acquire+0xd8/0x490
down_write_nested+0x44/0x120
__btrfs_tree_lock+0x27/0x120 [btrfs]
btrfs_search_slot+0x2a3/0xc50 [btrfs]
btrfs_insert_empty_items+0x58/0xa0 [btrfs]
insert_with_overflow+0x44/0x110 [btrfs]
btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
btrfs_setxattr+0xd6/0x4c0 [btrfs]
btrfs_setxattr_trans+0x68/0x100 [btrfs]
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x70/0x200
vfs_setxattr+0x6b/0x120
setxattr+0x125/0x240
path_setxattr+0xba/0xd0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
lock_acquire+0xd8/0x490
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
*** DEADLOCK ***
5 locks held by find/324157:
#0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
#1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
#2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
stack backtrace:
CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
check_noncircular+0xff/0x110
? mark_lock.part.0+0x468/0xe90
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
lock_acquire+0xd8/0x490
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
down_read_nested+0x45/0x220
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
? filldir+0x1d0/0x1d0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because btrfs_next_old_leaf searches down to our current
key, and then walks up the path until we can move to the next slot, and
then reads back down the path so we get the next leaf.
However it doesn't unlock any lower levels until it replaces them with
the new extent buffer. This is technically fine, but of course causes
lockdep to complain, because we could be holding locks on lower levels
while locking upper levels.
Fix this by dropping all nodes below the level that we use as our new
starting point before we start reading back down the path. This also
allows us to drop the nested/recursive locking magic, because we're no
longer locking two nodes at the same level anymore.
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-07 00:27:30 +03:00
if ( ! ret )
btrfs_tree_read_lock ( next ) ;
2008-06-26 00:01:30 +04:00
}
2007-02-21 00:40:44 +03:00
break ;
}
path - > slots [ level ] = slot ;
2009-01-06 05:25:51 +03:00
while ( 1 ) {
2007-02-21 00:40:44 +03:00
level - - ;
path - > nodes [ level ] = next ;
path - > slots [ level ] = 0 ;
2008-06-26 00:01:31 +04:00
if ( ! path - > skip_locking )
2020-11-07 00:27:29 +03:00
path - > locks [ level ] = BTRFS_READ_LOCK ;
2007-02-21 00:40:44 +03:00
if ( ! level )
break ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
2017-01-30 23:23:42 +03:00
ret = read_block_for_search ( root , path , & next , level ,
2017-02-10 20:44:32 +03:00
0 , & key ) ;
btrfs: fix assertion failure and blocking during nowait buffered write
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-11 03:54:40 +03:00
if ( ret = = - EAGAIN & & ! path - > nowait )
2009-04-03 18:14:18 +04:00
goto again ;
2009-05-14 21:24:30 +04:00
if ( ret < 0 ) {
2011-04-21 03:20:15 +04:00
btrfs_release_path ( path ) ;
2009-05-14 21:24:30 +04:00
goto done ;
}
btrfs: fix assertion failure and blocking during nowait buffered write
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-11 03:54:40 +03:00
if ( ! path - > skip_locking ) {
if ( path - > nowait ) {
if ( ! btrfs_try_tree_read_lock ( next ) ) {
ret = - EAGAIN ;
goto done ;
}
} else {
btrfs_tree_read_lock ( next ) ;
}
}
2007-02-21 00:40:44 +03:00
}
2009-04-03 18:14:18 +04:00
ret = 0 ;
2008-06-26 00:01:30 +04:00
done :
2012-03-19 23:54:38 +04:00
unlock_up ( path , 0 , 1 , 0 , NULL ) ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 15:03:38 +03:00
if ( need_commit_sem ) {
int ret2 ;
path - > need_commit_sem = 1 ;
ret2 = finish_need_commit_sem_search ( path ) ;
up_read ( & fs_info - > commit_root_sem ) ;
if ( ret2 )
ret = ret2 ;
}
2009-04-03 18:14:18 +04:00
return ret ;
2007-02-21 00:40:44 +03:00
}
2008-03-24 22:01:56 +03:00
2022-09-14 18:06:40 +03:00
int btrfs_next_old_item ( struct btrfs_root * root , struct btrfs_path * path , u64 time_seq )
{
path - > slots [ 0 ] + + ;
if ( path - > slots [ 0 ] > = btrfs_header_nritems ( path - > nodes [ 0 ] ) )
return btrfs_next_old_leaf ( root , path , time_seq ) ;
return 0 ;
}
2008-06-26 00:01:31 +04:00
/*
* this uses btrfs_prev_leaf to walk backwards in the tree , and keeps
* searching until it gets past min_objectid or finds an item of ' type '
*
* returns 0 if something is found , 1 if nothing was found and < 0 on error
*/
2008-03-24 22:01:56 +03:00
int btrfs_previous_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 min_objectid ,
int type )
{
struct btrfs_key found_key ;
struct extent_buffer * leaf ;
2008-09-06 00:13:11 +04:00
u32 nritems ;
2008-03-24 22:01:56 +03:00
int ret ;
2009-01-06 05:25:51 +03:00
while ( 1 ) {
2008-03-24 22:01:56 +03:00
if ( path - > slots [ 0 ] = = 0 ) {
ret = btrfs_prev_leaf ( root , path ) ;
if ( ret ! = 0 )
return ret ;
} else {
path - > slots [ 0 ] - - ;
}
leaf = path - > nodes [ 0 ] ;
2008-09-06 00:13:11 +04:00
nritems = btrfs_header_nritems ( leaf ) ;
if ( nritems = = 0 )
return 1 ;
if ( path - > slots [ 0 ] = = nritems )
path - > slots [ 0 ] - - ;
2008-03-24 22:01:56 +03:00
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
2008-09-06 00:13:11 +04:00
if ( found_key . objectid < min_objectid )
break ;
2009-07-24 19:06:53 +04:00
if ( found_key . type = = type )
return 0 ;
2008-09-06 00:13:11 +04:00
if ( found_key . objectid = = min_objectid & &
found_key . type < type )
break ;
2008-03-24 22:01:56 +03:00
}
return 1 ;
}
2014-01-12 17:38:33 +04:00
/*
* search in extent tree to find a previous Metadata / Data extent item with
* min objecitd .
*
* returns 0 if something is found , 1 if nothing was found and < 0 on error
*/
int btrfs_previous_extent_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 min_objectid )
{
struct btrfs_key found_key ;
struct extent_buffer * leaf ;
u32 nritems ;
int ret ;
while ( 1 ) {
if ( path - > slots [ 0 ] = = 0 ) {
ret = btrfs_prev_leaf ( root , path ) ;
if ( ret ! = 0 )
return ret ;
} else {
path - > slots [ 0 ] - - ;
}
leaf = path - > nodes [ 0 ] ;
nritems = btrfs_header_nritems ( leaf ) ;
if ( nritems = = 0 )
return 1 ;
if ( path - > slots [ 0 ] = = nritems )
path - > slots [ 0 ] - - ;
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
if ( found_key . objectid < min_objectid )
break ;
if ( found_key . type = = BTRFS_EXTENT_ITEM_KEY | |
found_key . type = = BTRFS_METADATA_ITEM_KEY )
return 0 ;
if ( found_key . objectid = = min_objectid & &
found_key . type < BTRFS_EXTENT_ITEM_KEY )
break ;
}
return 1 ;
}
2022-09-14 18:06:38 +03:00
int __init btrfs_ctree_init ( void )
{
btrfs_path_cachep = kmem_cache_create ( " btrfs_path " ,
sizeof ( struct btrfs_path ) , 0 ,
SLAB_MEM_SPREAD , NULL ) ;
if ( ! btrfs_path_cachep )
return - ENOMEM ;
return 0 ;
}
void __cold btrfs_ctree_exit ( void )
{
kmem_cache_destroy ( btrfs_path_cachep ) ;
}