2007-06-12 09:07:21 -04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation .
*
* This program is distributed in the hope that it will be useful ,
* but WITHOUT ANY WARRANTY ; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE . See the GNU
* General Public License for more details .
*
* You should have received a copy of the GNU General Public
* License along with this program ; if not , write to the
* Free Software Foundation , Inc . , 59 Temple Place - Suite 330 ,
* Boston , MA 021110 - 1307 , USA .
*/
2007-03-22 15:59:16 -04:00
# include <linux/fs.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2007-06-12 11:36:58 -04:00
# include <linux/sched.h>
2007-09-17 10:58:06 -04:00
# include <linux/writeback.h>
2007-10-15 16:14:19 -04:00
# include <linux/pagemap.h>
2008-11-07 18:22:45 -05:00
# include <linux/blkdev.h>
2007-03-22 15:59:16 -04:00
# include "ctree.h"
# include "disk-io.h"
# include "transaction.h"
2008-06-25 16:01:30 -04:00
# include "locking.h"
2008-09-05 16:13:11 -04:00
# include "tree-log.h"
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
# include "inode-map.h"
2007-03-22 15:59:16 -04:00
2007-04-09 10:42:37 -04:00
# define BTRFS_ROOT_TRANS_TAG 0
2008-02-01 16:35:04 -05:00
static noinline void put_transaction ( struct btrfs_transaction * transaction )
2007-03-22 15:59:16 -04:00
{
2011-04-11 15:45:29 -04:00
WARN_ON ( atomic_read ( & transaction - > use_count ) = = 0 ) ;
if ( atomic_dec_and_test ( & transaction - > use_count ) ) {
2011-04-11 17:25:13 -04:00
BUG_ON ( ! list_empty ( & transaction - > list ) ) ;
2011-09-14 12:37:00 +02:00
WARN_ON ( transaction - > delayed_refs . root . rb_node ) ;
WARN_ON ( ! list_empty ( & transaction - > delayed_refs . seq_head ) ) ;
2007-04-02 10:50:19 -04:00
memset ( transaction , 0 , sizeof ( * transaction ) ) ;
kmem_cache_free ( btrfs_transaction_cachep , transaction ) ;
2007-03-25 11:35:08 -04:00
}
2007-03-22 15:59:16 -04:00
}
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
static noinline void switch_commit_root ( struct btrfs_root * root )
{
free_extent_buffer ( root - > commit_root ) ;
root - > commit_root = btrfs_root_node ( root ) ;
}
2008-09-29 15:18:18 -04:00
/*
* either allocate a new transaction or hop into the existing one
*/
2011-04-11 17:25:13 -04:00
static noinline int join_transaction ( struct btrfs_root * root , int nofail )
2007-03-22 15:59:16 -04:00
{
struct btrfs_transaction * cur_trans ;
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
2011-11-06 03:26:19 -05:00
loop :
2011-04-11 17:25:13 -04:00
if ( root - > fs_info - > trans_no_join ) {
if ( ! nofail ) {
spin_unlock ( & root - > fs_info - > trans_lock ) ;
return - EBUSY ;
}
}
2007-03-22 15:59:16 -04:00
cur_trans = root - > fs_info - > running_transaction ;
2011-04-11 17:25:13 -04:00
if ( cur_trans ) {
atomic_inc ( & cur_trans - > use_count ) ;
2011-04-11 15:45:29 -04:00
atomic_inc ( & cur_trans - > num_writers ) ;
2007-08-10 16:22:09 -04:00
cur_trans - > num_joined + + ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
return 0 ;
2007-03-22 15:59:16 -04:00
}
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
cur_trans = kmem_cache_alloc ( btrfs_transaction_cachep , GFP_NOFS ) ;
if ( ! cur_trans )
return - ENOMEM ;
2011-11-06 03:26:19 -05:00
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
if ( root - > fs_info - > running_transaction ) {
2011-11-06 03:26:19 -05:00
/*
* someone started a transaction after we unlocked . Make sure
* to redo the trans_no_join checks above
*/
2011-04-11 17:25:13 -04:00
kmem_cache_free ( btrfs_transaction_cachep , cur_trans ) ;
cur_trans = root - > fs_info - > running_transaction ;
2011-11-06 03:26:19 -05:00
goto loop ;
2007-03-22 15:59:16 -04:00
}
2011-11-06 03:26:19 -05:00
2011-04-11 17:25:13 -04:00
atomic_set ( & cur_trans - > num_writers , 1 ) ;
cur_trans - > num_joined = 0 ;
init_waitqueue_head ( & cur_trans - > writer_wait ) ;
init_waitqueue_head ( & cur_trans - > commit_wait ) ;
cur_trans - > in_commit = 0 ;
cur_trans - > blocked = 0 ;
/*
* One for this trans handle , one so it will live on until we
* commit the transaction .
*/
atomic_set ( & cur_trans - > use_count , 2 ) ;
cur_trans - > commit_done = 0 ;
cur_trans - > start_time = get_seconds ( ) ;
cur_trans - > delayed_refs . root = RB_ROOT ;
cur_trans - > delayed_refs . num_entries = 0 ;
cur_trans - > delayed_refs . num_heads_ready = 0 ;
cur_trans - > delayed_refs . num_heads = 0 ;
cur_trans - > delayed_refs . flushing = 0 ;
cur_trans - > delayed_refs . run_delayed_start = 0 ;
2011-09-14 12:37:00 +02:00
cur_trans - > delayed_refs . seq = 1 ;
2011-12-12 16:10:07 +01:00
init_waitqueue_head ( & cur_trans - > delayed_refs . seq_wait ) ;
2011-04-11 17:25:13 -04:00
spin_lock_init ( & cur_trans - > commit_lock ) ;
spin_lock_init ( & cur_trans - > delayed_refs . lock ) ;
2011-09-14 12:37:00 +02:00
INIT_LIST_HEAD ( & cur_trans - > delayed_refs . seq_head ) ;
2011-04-11 17:25:13 -04:00
INIT_LIST_HEAD ( & cur_trans - > pending_snapshots ) ;
list_add_tail ( & cur_trans - > list , & root - > fs_info - > trans_list ) ;
extent_io_tree_init ( & cur_trans - > dirty_pages ,
2011-05-28 07:00:39 -04:00
root - > fs_info - > btree_inode - > i_mapping ) ;
2011-04-11 17:25:13 -04:00
root - > fs_info - > generation + + ;
cur_trans - > transid = root - > fs_info - > generation ;
root - > fs_info - > running_transaction = cur_trans ;
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2007-08-10 16:22:09 -04:00
2007-03-22 15:59:16 -04:00
return 0 ;
}
2008-09-29 15:18:18 -04:00
/*
2009-01-05 21:25:51 -05:00
* this does all the record keeping required to make sure that a reference
* counted root is properly recorded in a given transaction . This is required
* to make sure the old root from before we joined the transaction is deleted
* when the transaction commits
2008-09-29 15:18:18 -04:00
*/
2011-06-13 20:00:16 -04:00
static int record_root_in_trans ( struct btrfs_trans_handle * trans ,
2011-04-11 17:25:13 -04:00
struct btrfs_root * root )
2007-08-07 16:15:09 -04:00
{
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
if ( root - > ref_cows & & root - > last_trans < trans - > transid ) {
2007-08-07 16:15:09 -04:00
WARN_ON ( root = = root - > fs_info - > extent_root ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
WARN_ON ( root - > commit_root ! = root - > node ) ;
2011-06-13 20:00:16 -04:00
/*
* see below for in_trans_setup usage rules
* we have the reloc mutex held now , so there
* is only one writer in this function
*/
root - > in_trans_setup = 1 ;
/* make sure readers find in_trans_setup before
* they find our root - > last_trans update
*/
smp_wmb ( ) ;
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > fs_roots_radix_lock ) ;
if ( root - > last_trans = = trans - > transid ) {
spin_unlock ( & root - > fs_info - > fs_roots_radix_lock ) ;
return 0 ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
radix_tree_tag_set ( & root - > fs_info - > fs_roots_radix ,
( unsigned long ) root - > root_key . objectid ,
BTRFS_ROOT_TRANS_TAG ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > fs_roots_radix_lock ) ;
2011-06-13 20:00:16 -04:00
root - > last_trans = trans - > transid ;
/* this is pretty tricky. We don't want to
* take the relocation lock in btrfs_record_root_in_trans
* unless we ' re really doing the first setup for this root in
* this transaction .
*
* Normally we ' d use root - > last_trans as a flag to decide
* if we want to take the expensive mutex .
*
* But , we have to set root - > last_trans before we
* init the relocation root , otherwise , we trip over warnings
* in ctree . c . The solution used here is to flag ourselves
* with root - > in_trans_setup . When this is 1 , we ' re still
* fixing up the reloc trees and everyone must wait .
*
* When this is zero , they can trust root - > last_trans and fly
* through btrfs_record_root_in_trans without having to take the
* lock . smp_wmb ( ) makes sure that all the writes above are
* done before we pop in the zero below
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
btrfs_init_reloc_root ( trans , root ) ;
2011-06-13 20:00:16 -04:00
smp_wmb ( ) ;
root - > in_trans_setup = 0 ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
}
return 0 ;
}
2008-07-30 16:29:20 -04:00
2011-06-13 20:00:16 -04:00
int btrfs_record_root_in_trans ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
if ( ! root - > ref_cows )
return 0 ;
/*
* see record_root_in_trans for comments about in_trans_setup usage
* and barriers
*/
smp_rmb ( ) ;
if ( root - > last_trans = = trans - > transid & &
! root - > in_trans_setup )
return 0 ;
mutex_lock ( & root - > fs_info - > reloc_mutex ) ;
record_root_in_trans ( trans , root ) ;
mutex_unlock ( & root - > fs_info - > reloc_mutex ) ;
return 0 ;
}
2008-09-29 15:18:18 -04:00
/* wait for commit against the current transaction to become unblocked
* when this is done , it is safe to start a new transaction , but the current
* transaction might not be fully on disk .
*/
2008-07-31 10:48:37 -04:00
static void wait_current_trans ( struct btrfs_root * root )
2007-03-22 15:59:16 -04:00
{
2008-07-17 12:54:14 -04:00
struct btrfs_transaction * cur_trans ;
2007-03-22 15:59:16 -04:00
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
2008-07-17 12:54:14 -04:00
cur_trans = root - > fs_info - > running_transaction ;
2008-07-31 10:48:37 -04:00
if ( cur_trans & & cur_trans - > blocked ) {
2011-04-11 15:45:29 -04:00
atomic_inc ( & cur_trans - > use_count ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2011-07-14 03:17:00 +00:00
wait_event ( root - > fs_info - > transaction_wait ,
! cur_trans - > blocked ) ;
2008-07-17 12:54:14 -04:00
put_transaction ( cur_trans ) ;
2011-04-11 17:25:13 -04:00
} else {
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2008-07-17 12:54:14 -04:00
}
2008-07-31 10:48:37 -04:00
}
2009-11-10 21:23:48 -05:00
enum btrfs_trans_type {
TRANS_START ,
TRANS_JOIN ,
TRANS_USERSPACE ,
2010-06-21 14:48:16 -04:00
TRANS_JOIN_NOLOCK ,
2009-11-10 21:23:48 -05:00
} ;
2010-05-16 10:48:46 -04:00
static int may_wait_transaction ( struct btrfs_root * root , int type )
{
2011-04-11 17:25:13 -04:00
if ( root - > fs_info - > log_root_recovering )
return 0 ;
if ( type = = TRANS_USERSPACE )
return 1 ;
if ( type = = TRANS_START & &
! atomic_read ( & root - > fs_info - > open_ioctl_trans ) )
2010-05-16 10:48:46 -04:00
return 1 ;
2011-04-11 17:25:13 -04:00
2010-05-16 10:48:46 -04:00
return 0 ;
}
2008-09-05 16:13:11 -04:00
static struct btrfs_trans_handle * start_transaction ( struct btrfs_root * root ,
2010-05-16 10:48:46 -04:00
u64 num_items , int type )
2008-07-31 10:48:37 -04:00
{
2010-05-16 10:48:46 -04:00
struct btrfs_trans_handle * h ;
struct btrfs_transaction * cur_trans ;
2011-06-07 15:07:51 -04:00
u64 num_bytes = 0 ;
2008-07-31 10:48:37 -04:00
int ret ;
2011-01-06 19:30:25 +08:00
if ( root - > fs_info - > fs_state & BTRFS_SUPER_FLAG_ERROR )
return ERR_PTR ( - EROFS ) ;
2011-04-13 15:15:59 -04:00
if ( current - > journal_info ) {
WARN_ON ( type ! = TRANS_JOIN & & type ! = TRANS_JOIN_NOLOCK ) ;
h = current - > journal_info ;
h - > use_count + + ;
h - > orig_rsv = h - > block_rsv ;
h - > block_rsv = NULL ;
goto got_it ;
}
2011-06-07 15:07:51 -04:00
/*
* Do the reservation before we join the transaction so we can do all
* the appropriate flushing if need be .
*/
if ( num_items > 0 & & root ! = root - > fs_info - > chunk_root ) {
num_bytes = btrfs_calc_trans_metadata_size ( root , num_items ) ;
2011-08-30 12:34:28 -04:00
ret = btrfs_block_rsv_add ( root ,
2011-06-07 15:07:51 -04:00
& root - > fs_info - > trans_block_rsv ,
num_bytes ) ;
if ( ret )
return ERR_PTR ( ret ) ;
}
2010-05-16 10:48:46 -04:00
again :
h = kmem_cache_alloc ( btrfs_trans_handle_cachep , GFP_NOFS ) ;
if ( ! h )
return ERR_PTR ( - ENOMEM ) ;
2008-07-31 10:48:37 -04:00
2010-05-16 10:48:46 -04:00
if ( may_wait_transaction ( root , type ) )
2008-07-31 10:48:37 -04:00
wait_current_trans ( root ) ;
2010-05-16 10:48:46 -04:00
2011-04-11 17:25:13 -04:00
do {
ret = join_transaction ( root , type = = TRANS_JOIN_NOLOCK ) ;
if ( ret = = - EBUSY )
wait_current_trans ( root ) ;
} while ( ret = = - EBUSY ) ;
2011-03-23 08:14:16 +00:00
if ( ret < 0 ) {
2011-04-03 12:31:28 +00:00
kmem_cache_free ( btrfs_trans_handle_cachep , h ) ;
2011-03-23 08:14:16 +00:00
return ERR_PTR ( ret ) ;
}
2007-04-09 10:42:37 -04:00
2010-05-16 10:48:46 -04:00
cur_trans = root - > fs_info - > running_transaction ;
h - > transid = cur_trans - > transid ;
h - > transaction = cur_trans ;
2007-03-22 15:59:16 -04:00
h - > blocks_used = 0 ;
2010-05-16 10:48:46 -04:00
h - > bytes_reserved = 0 ;
2009-03-13 10:10:06 -04:00
h - > delayed_ref_updates = 0 ;
2011-04-13 15:15:59 -04:00
h - > use_count = 1 ;
2010-05-16 10:46:25 -04:00
h - > block_rsv = NULL ;
2011-04-13 15:15:59 -04:00
h - > orig_rsv = NULL ;
2009-03-12 20:12:45 -04:00
2010-05-16 10:48:46 -04:00
smp_mb ( ) ;
if ( cur_trans - > blocked & & may_wait_transaction ( root , type ) ) {
btrfs_commit_transaction ( h , root ) ;
goto again ;
}
2011-06-07 15:07:51 -04:00
if ( num_bytes ) {
2012-01-10 10:31:31 -05:00
trace_btrfs_space_reservation ( root - > fs_info , " transaction " ,
2012-02-24 10:39:05 -05:00
( u64 ) ( unsigned long ) h ,
num_bytes , 1 ) ;
2011-06-07 15:07:51 -04:00
h - > block_rsv = & root - > fs_info - > trans_block_rsv ;
h - > bytes_reserved = num_bytes ;
2010-05-16 10:48:46 -04:00
}
2009-09-11 16:12:44 -04:00
2011-04-13 15:15:59 -04:00
got_it :
2011-04-11 17:25:13 -04:00
btrfs_record_root_in_trans ( h , root ) ;
2010-05-16 10:48:46 -04:00
if ( ! current - > journal_info & & type ! = TRANS_USERSPACE )
current - > journal_info = h ;
2007-03-22 15:59:16 -04:00
return h ;
}
2008-07-17 12:54:14 -04:00
struct btrfs_trans_handle * btrfs_start_transaction ( struct btrfs_root * root ,
2010-05-16 10:48:46 -04:00
int num_items )
2008-07-17 12:54:14 -04:00
{
2010-05-16 10:48:46 -04:00
return start_transaction ( root , num_items , TRANS_START ) ;
2008-07-17 12:54:14 -04:00
}
2011-04-13 12:54:33 -04:00
struct btrfs_trans_handle * btrfs_join_transaction ( struct btrfs_root * root )
2008-07-17 12:54:14 -04:00
{
2010-05-16 10:48:46 -04:00
return start_transaction ( root , 0 , TRANS_JOIN ) ;
2008-07-17 12:54:14 -04:00
}
2011-04-13 12:54:33 -04:00
struct btrfs_trans_handle * btrfs_join_transaction_nolock ( struct btrfs_root * root )
2010-06-21 14:48:16 -04:00
{
return start_transaction ( root , 0 , TRANS_JOIN_NOLOCK ) ;
}
2011-04-13 12:54:33 -04:00
struct btrfs_trans_handle * btrfs_start_ioctl_transaction ( struct btrfs_root * root )
2008-08-04 10:41:27 -04:00
{
2011-04-13 12:54:33 -04:00
return start_transaction ( root , 0 , TRANS_USERSPACE ) ;
2008-08-04 10:41:27 -04:00
}
2008-09-29 15:18:18 -04:00
/* wait for a transaction commit to be fully complete */
2011-07-14 03:17:14 +00:00
static noinline void wait_for_commit ( struct btrfs_root * root ,
2008-06-25 16:01:31 -04:00
struct btrfs_transaction * commit )
{
2011-07-14 03:17:00 +00:00
wait_event ( commit - > commit_wait , commit - > commit_done ) ;
2008-06-25 16:01:31 -04:00
}
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
int btrfs_wait_for_commit ( struct btrfs_root * root , u64 transid )
{
struct btrfs_transaction * cur_trans = NULL , * t ;
int ret ;
ret = 0 ;
if ( transid ) {
if ( transid < = root - > fs_info - > last_trans_committed )
2011-04-11 17:25:13 -04:00
goto out ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
/* find specified transaction */
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
list_for_each_entry ( t , & root - > fs_info - > trans_list , list ) {
if ( t - > transid = = transid ) {
cur_trans = t ;
2011-04-11 17:25:13 -04:00
atomic_inc ( & cur_trans - > use_count ) ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
break ;
}
if ( t - > transid > transid )
break ;
}
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
ret = - EINVAL ;
if ( ! cur_trans )
2011-04-11 17:25:13 -04:00
goto out ; /* bad transid */
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
} else {
/* find newest transaction that is committing | committed */
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
list_for_each_entry_reverse ( t , & root - > fs_info - > trans_list ,
list ) {
if ( t - > in_commit ) {
if ( t - > commit_done )
2011-06-09 10:15:17 -04:00
break ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
cur_trans = t ;
2011-04-11 17:25:13 -04:00
atomic_inc ( & cur_trans - > use_count ) ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
break ;
}
}
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
if ( ! cur_trans )
2011-04-11 17:25:13 -04:00
goto out ; /* nothing committing|committed */
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
}
wait_for_commit ( root , cur_trans ) ;
put_transaction ( cur_trans ) ;
ret = 0 ;
2011-04-11 17:25:13 -04:00
out :
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
return ret ;
}
2008-07-31 10:48:37 -04:00
void btrfs_throttle ( struct btrfs_root * root )
{
2011-04-11 17:25:13 -04:00
if ( ! atomic_read ( & root - > fs_info - > open_ioctl_trans ) )
2008-08-04 10:41:27 -04:00
wait_current_trans ( root ) ;
2008-07-31 10:48:37 -04:00
}
2010-05-16 10:49:58 -04:00
static int should_end_transaction ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
int ret ;
2011-10-18 12:15:48 -04:00
ret = btrfs_block_rsv_check ( root , & root - > fs_info - > global_block_rsv , 5 ) ;
2010-05-16 10:49:58 -04:00
return ret ? 1 : 0 ;
}
int btrfs_should_end_transaction ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
struct btrfs_transaction * cur_trans = trans - > transaction ;
2011-09-19 11:58:54 -04:00
struct btrfs_block_rsv * rsv = trans - > block_rsv ;
2010-05-16 10:49:58 -04:00
int updates ;
2011-04-11 17:25:13 -04:00
smp_mb ( ) ;
2010-05-16 10:49:58 -04:00
if ( cur_trans - > blocked | | cur_trans - > delayed_refs . flushing )
return 1 ;
2011-09-19 11:58:54 -04:00
/*
* We need to do this in case we ' re deleting csums so the global block
* rsv get ' s used instead of the csum block rsv .
*/
trans - > block_rsv = NULL ;
2010-05-16 10:49:58 -04:00
updates = trans - > delayed_ref_updates ;
trans - > delayed_ref_updates = 0 ;
if ( updates )
btrfs_run_delayed_refs ( trans , root , updates ) ;
2011-09-19 11:58:54 -04:00
trans - > block_rsv = rsv ;
2010-05-16 10:49:58 -04:00
return should_end_transaction ( trans , root ) ;
}
2008-06-25 16:01:31 -04:00
static int __btrfs_end_transaction ( struct btrfs_trans_handle * trans ,
2010-06-21 14:48:16 -04:00
struct btrfs_root * root , int throttle , int lock )
2007-03-22 15:59:16 -04:00
{
2010-05-16 10:49:58 -04:00
struct btrfs_transaction * cur_trans = trans - > transaction ;
2008-07-29 16:15:18 -04:00
struct btrfs_fs_info * info = root - > fs_info ;
2009-03-13 10:17:05 -04:00
int count = 0 ;
2011-04-13 15:15:59 -04:00
if ( - - trans - > use_count ) {
trans - > block_rsv = trans - > orig_rsv ;
return 0 ;
}
2011-10-14 14:40:17 -04:00
btrfs_trans_release_metadata ( trans , root ) ;
2011-08-30 11:31:29 -04:00
trans - > block_rsv = NULL ;
2012-01-06 15:23:57 -05:00
while ( count < 2 ) {
2009-03-13 10:17:05 -04:00
unsigned long cur = trans - > delayed_ref_updates ;
trans - > delayed_ref_updates = 0 ;
if ( cur & &
trans - > transaction - > delayed_refs . num_heads_ready > 64 ) {
trans - > delayed_ref_updates = 0 ;
btrfs_run_delayed_refs ( trans , root , cur ) ;
} else {
break ;
}
count + + ;
2009-03-13 10:10:06 -04:00
}
2011-04-11 17:25:13 -04:00
if ( lock & & ! atomic_read ( & root - > fs_info - > open_ioctl_trans ) & &
should_end_transaction ( trans , root ) ) {
2010-05-16 10:49:58 -04:00
trans - > transaction - > blocked = 1 ;
2011-04-11 17:25:13 -04:00
smp_wmb ( ) ;
}
2010-05-16 10:49:58 -04:00
2010-06-21 14:48:16 -04:00
if ( lock & & cur_trans - > blocked & & ! cur_trans - > in_commit ) {
2011-07-24 15:45:34 -04:00
if ( throttle ) {
/*
* We may race with somebody else here so end up having
* to call end_transaction on ourselves again , so inc
* our use_count .
*/
trans - > use_count + + ;
2010-05-16 10:49:58 -04:00
return btrfs_commit_transaction ( trans , root ) ;
2011-07-24 15:45:34 -04:00
} else {
2010-05-16 10:49:58 -04:00
wake_up_process ( info - > transaction_kthread ) ;
2011-07-24 15:45:34 -04:00
}
2010-05-16 10:49:58 -04:00
}
WARN_ON ( cur_trans ! = info - > running_transaction ) ;
2011-04-11 15:45:29 -04:00
WARN_ON ( atomic_read ( & cur_trans - > num_writers ) < 1 ) ;
atomic_dec ( & cur_trans - > num_writers ) ;
2008-06-25 16:01:31 -04:00
2010-10-29 15:37:34 -04:00
smp_mb ( ) ;
2007-03-22 15:59:16 -04:00
if ( waitqueue_active ( & cur_trans - > writer_wait ) )
wake_up ( & cur_trans - > writer_wait ) ;
put_transaction ( cur_trans ) ;
2009-09-11 16:12:44 -04:00
if ( current - > journal_info = = trans )
current - > journal_info = NULL ;
2007-03-30 14:27:56 -04:00
memset ( trans , 0 , sizeof ( * trans ) ) ;
2007-04-02 10:50:19 -04:00
kmem_cache_free ( btrfs_trans_handle_cachep , trans ) ;
2008-07-29 16:15:18 -04:00
2009-11-12 09:36:34 +00:00
if ( throttle )
btrfs_run_delayed_iputs ( root ) ;
2007-03-22 15:59:16 -04:00
return 0 ;
}
2008-06-25 16:01:31 -04:00
int btrfs_end_transaction ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
int ret ;
ret = __btrfs_end_transaction ( trans , root , 0 , 1 ) ;
if ( ret )
return ret ;
return 0 ;
2008-06-25 16:01:31 -04:00
}
int btrfs_end_transaction_throttle ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
int ret ;
ret = __btrfs_end_transaction ( trans , root , 1 , 1 ) ;
if ( ret )
return ret ;
return 0 ;
2010-06-21 14:48:16 -04:00
}
int btrfs_end_transaction_nolock ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
int ret ;
ret = __btrfs_end_transaction ( trans , root , 0 , 0 ) ;
if ( ret )
return ret ;
return 0 ;
}
int btrfs_end_transaction_dmeta ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
return __btrfs_end_transaction ( trans , root , 1 , 1 ) ;
2008-06-25 16:01:31 -04:00
}
2008-09-29 15:18:18 -04:00
/*
* when btree blocks are allocated , they have some corresponding bits set for
* them in one of two extent_io trees . This is used to make sure all of
2009-10-13 13:29:19 -04:00
* those extents are sent to disk but does not wait on them
2008-09-29 15:18:18 -04:00
*/
2009-10-13 13:29:19 -04:00
int btrfs_write_marked_extents ( struct btrfs_root * root ,
2009-11-12 09:33:26 +00:00
struct extent_io_tree * dirty_pages , int mark )
2007-03-22 15:59:16 -04:00
{
2008-08-15 15:34:15 -04:00
int err = 0 ;
2007-04-28 09:29:35 -04:00
int werr = 0 ;
2011-09-26 13:58:47 -04:00
struct address_space * mapping = root - > fs_info - > btree_inode - > i_mapping ;
2008-08-15 15:34:15 -04:00
u64 start = 0 ;
2007-10-15 16:14:19 -04:00
u64 end ;
2007-04-28 09:29:35 -04:00
2011-09-26 13:58:47 -04:00
while ( ! find_first_extent_bit ( dirty_pages , start , & start , & end ,
mark ) ) {
convert_extent_bit ( dirty_pages , start , end , EXTENT_NEED_WAIT , mark ,
GFP_NOFS ) ;
err = filemap_fdatawrite_range ( mapping , start , end ) ;
if ( err )
werr = err ;
cond_resched ( ) ;
start = end + 1 ;
2007-04-28 09:29:35 -04:00
}
2009-10-13 13:29:19 -04:00
if ( err )
werr = err ;
return werr ;
}
/*
* when btree blocks are allocated , they have some corresponding bits set for
* them in one of two extent_io trees . This is used to make sure all of
* those extents are on disk for transaction or log commit . We wait
* on all the pages and clear them from the dirty pages state tree
*/
int btrfs_wait_marked_extents ( struct btrfs_root * root ,
2009-11-12 09:33:26 +00:00
struct extent_io_tree * dirty_pages , int mark )
2009-10-13 13:29:19 -04:00
{
int err = 0 ;
int werr = 0 ;
2011-09-26 13:58:47 -04:00
struct address_space * mapping = root - > fs_info - > btree_inode - > i_mapping ;
2009-10-13 13:29:19 -04:00
u64 start = 0 ;
u64 end ;
2008-08-15 15:34:15 -04:00
2011-09-26 13:58:47 -04:00
while ( ! find_first_extent_bit ( dirty_pages , start , & start , & end ,
EXTENT_NEED_WAIT ) ) {
clear_extent_bits ( dirty_pages , start , end , EXTENT_NEED_WAIT , GFP_NOFS ) ;
err = filemap_fdatawait_range ( mapping , start , end ) ;
if ( err )
werr = err ;
cond_resched ( ) ;
start = end + 1 ;
2008-08-15 15:34:15 -04:00
}
2007-04-28 09:29:35 -04:00
if ( err )
werr = err ;
return werr ;
2007-03-22 15:59:16 -04:00
}
2009-10-13 13:29:19 -04:00
/*
* when btree blocks are allocated , they have some corresponding bits set for
* them in one of two extent_io trees . This is used to make sure all of
* those extents are on disk for transaction or log commit
*/
int btrfs_write_and_wait_marked_extents ( struct btrfs_root * root ,
2009-11-12 09:33:26 +00:00
struct extent_io_tree * dirty_pages , int mark )
2009-10-13 13:29:19 -04:00
{
int ret ;
int ret2 ;
2009-11-12 09:33:26 +00:00
ret = btrfs_write_marked_extents ( root , dirty_pages , mark ) ;
ret2 = btrfs_wait_marked_extents ( root , dirty_pages , mark ) ;
2011-11-04 12:29:37 -04:00
if ( ret )
return ret ;
if ( ret2 )
return ret2 ;
return 0 ;
2009-10-13 13:29:19 -04:00
}
2008-09-11 16:17:57 -04:00
int btrfs_write_and_wait_transaction ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
if ( ! trans | | ! trans - > transaction ) {
struct inode * btree_inode ;
btree_inode = root - > fs_info - > btree_inode ;
return filemap_write_and_wait ( btree_inode - > i_mapping ) ;
}
return btrfs_write_and_wait_marked_extents ( root ,
2009-11-12 09:33:26 +00:00
& trans - > transaction - > dirty_pages ,
EXTENT_DIRTY ) ;
2008-09-11 16:17:57 -04:00
}
2008-09-29 15:18:18 -04:00
/*
* this is used to update the root pointer in the tree of tree roots .
*
* But , in the case of the extent allocation tree , updating the root
* pointer may allocate blocks which may change the root of the extent
* allocation tree .
*
* So , this loops and repeats and makes sure the cowonly root didn ' t
* change while the root pointer was being updated in the metadata .
*/
2008-03-24 15:01:56 -04:00
static int update_cowonly_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
2007-03-22 15:59:16 -04:00
{
int ret ;
2008-03-24 15:01:56 -04:00
u64 old_root_bytenr ;
2009-11-12 09:36:50 +00:00
u64 old_root_used ;
2008-03-24 15:01:56 -04:00
struct btrfs_root * tree_root = root - > fs_info - > tree_root ;
2007-03-22 15:59:16 -04:00
2009-11-12 09:36:50 +00:00
old_root_used = btrfs_root_used ( & root - > root_item ) ;
2008-03-24 15:01:56 -04:00
btrfs_write_dirty_block_groups ( trans , root ) ;
2009-03-13 10:10:06 -04:00
2009-01-05 21:25:51 -05:00
while ( 1 ) {
2008-03-24 15:01:56 -04:00
old_root_bytenr = btrfs_root_bytenr ( & root - > root_item ) ;
2009-11-12 09:36:50 +00:00
if ( old_root_bytenr = = root - > node - > start & &
old_root_used = = btrfs_root_used ( & root - > root_item ) )
2007-03-22 15:59:16 -04:00
break ;
2008-10-30 11:23:27 -04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
btrfs_set_root_node ( & root - > root_item , root - > node ) ;
2007-03-22 15:59:16 -04:00
ret = btrfs_update_root ( trans , tree_root ,
2008-03-24 15:01:56 -04:00
& root - > root_key ,
& root - > root_item ) ;
2007-03-22 15:59:16 -04:00
BUG_ON ( ret ) ;
2009-03-13 10:10:06 -04:00
2009-11-12 09:36:50 +00:00
old_root_used = btrfs_root_used ( & root - > root_item ) ;
2009-07-22 10:07:05 -04:00
ret = btrfs_write_dirty_block_groups ( trans , root ) ;
2009-03-13 10:10:06 -04:00
BUG_ON ( ret ) ;
2008-03-24 15:01:56 -04:00
}
2009-07-30 09:40:40 -04:00
if ( root ! = root - > fs_info - > extent_root )
switch_commit_root ( root ) ;
2008-03-24 15:01:56 -04:00
return 0 ;
}
2008-09-29 15:18:18 -04:00
/*
* update all the cowonly tree roots on disk
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
static noinline int commit_cowonly_roots ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
2008-03-24 15:01:56 -04:00
{
struct btrfs_fs_info * fs_info = root - > fs_info ;
struct list_head * next ;
2008-10-29 14:49:05 -04:00
struct extent_buffer * eb ;
2009-03-13 10:10:06 -04:00
int ret ;
2008-10-29 14:49:05 -04:00
2009-03-13 10:10:06 -04:00
ret = btrfs_run_delayed_refs ( trans , root , ( unsigned long ) - 1 ) ;
BUG_ON ( ret ) ;
2008-10-30 11:23:27 -04:00
2008-10-29 14:49:05 -04:00
eb = btrfs_lock_root_node ( fs_info - > tree_root ) ;
2009-03-13 10:24:59 -04:00
btrfs_cow_block ( trans , fs_info - > tree_root , eb , NULL , 0 , & eb ) ;
2008-10-29 14:49:05 -04:00
btrfs_tree_unlock ( eb ) ;
free_extent_buffer ( eb ) ;
2008-03-24 15:01:56 -04:00
2009-03-13 10:10:06 -04:00
ret = btrfs_run_delayed_refs ( trans , root , ( unsigned long ) - 1 ) ;
BUG_ON ( ret ) ;
2008-10-30 11:23:27 -04:00
2009-01-05 21:25:51 -05:00
while ( ! list_empty ( & fs_info - > dirty_cowonly_roots ) ) {
2008-03-24 15:01:56 -04:00
next = fs_info - > dirty_cowonly_roots . next ;
list_del_init ( next ) ;
root = list_entry ( next , struct btrfs_root , dirty_list ) ;
2008-10-30 11:23:27 -04:00
2008-03-24 15:01:56 -04:00
update_cowonly_root ( trans , root ) ;
2007-03-22 15:59:16 -04:00
}
2009-07-30 09:40:40 -04:00
down_write ( & fs_info - > extent_commit_sem ) ;
switch_commit_root ( fs_info - > extent_root ) ;
up_write ( & fs_info - > extent_commit_sem ) ;
2007-03-22 15:59:16 -04:00
return 0 ;
}
2008-09-29 15:18:18 -04:00
/*
* dead roots are old snapshots that need to be deleted . This allocates
* a dirty root struct and adds it into the list of dead roots that need to
* be deleted
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_add_dead_root ( struct btrfs_root * root )
2007-06-22 14:16:25 -04:00
{
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
list_add ( & root - > root_list , & root - > fs_info - > dead_roots ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2007-06-22 14:16:25 -04:00
return 0 ;
}
2008-09-29 15:18:18 -04:00
/*
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
* update all the cowonly tree roots on disk
2008-09-29 15:18:18 -04:00
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
static noinline int commit_fs_roots ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
2007-04-09 10:42:37 -04:00
{
struct btrfs_root * gang [ 8 ] ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-04-09 10:42:37 -04:00
int i ;
int ret ;
2007-06-22 14:16:25 -04:00
int err = 0 ;
2011-04-11 17:25:13 -04:00
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
2009-01-05 21:25:51 -05:00
while ( 1 ) {
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
ret = radix_tree_gang_lookup_tag ( & fs_info - > fs_roots_radix ,
( void * * ) gang , 0 ,
2007-04-09 10:42:37 -04:00
ARRAY_SIZE ( gang ) ,
BTRFS_ROOT_TRANS_TAG ) ;
if ( ret = = 0 )
break ;
for ( i = 0 ; i < ret ; i + + ) {
root = gang [ i ] ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
radix_tree_tag_clear ( & fs_info - > fs_roots_radix ,
( unsigned long ) root - > root_key . objectid ,
BTRFS_ROOT_TRANS_TAG ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2008-07-28 15:32:19 -04:00
2008-09-05 16:13:11 -04:00
btrfs_free_log ( trans , root ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
btrfs_update_reloc_root ( trans , root ) ;
2010-05-16 10:49:58 -04:00
btrfs_orphan_commit_root ( trans , root ) ;
2008-07-30 16:29:20 -04:00
2011-04-20 10:33:24 +08:00
btrfs_save_ino_cache ( root , trans ) ;
2011-11-14 20:48:06 -05:00
/* see comments in should_cow_block() */
root - > force_cow = 0 ;
smp_wmb ( ) ;
2009-06-15 20:01:02 -04:00
if ( root - > commit_root ! = root - > node ) {
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
mutex_lock ( & root - > fs_commit_mutex ) ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
switch_commit_root ( root ) ;
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
btrfs_unpin_free_ino ( root ) ;
mutex_unlock ( & root - > fs_commit_mutex ) ;
2009-06-15 20:01:02 -04:00
btrfs_set_root_node ( & root - > root_item ,
root - > node ) ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
err = btrfs_update_root ( trans , fs_info - > tree_root ,
2007-04-09 10:42:37 -04:00
& root - > root_key ,
& root - > root_item ) ;
2011-04-11 17:25:13 -04:00
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
2007-06-22 14:16:25 -04:00
if ( err )
break ;
2007-04-09 10:42:37 -04:00
}
}
2011-04-11 17:25:13 -04:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2007-06-22 14:16:25 -04:00
return err ;
2007-04-09 10:42:37 -04:00
}
2008-09-29 15:18:18 -04:00
/*
* defrag a given btree . If cacheonly = = 1 , this won ' t read from the disk ,
* otherwise every leaf in the btree is read and defragged .
*/
2007-08-10 14:06:19 -04:00
int btrfs_defrag_root ( struct btrfs_root * root , int cacheonly )
{
struct btrfs_fs_info * info = root - > fs_info ;
struct btrfs_trans_handle * trans ;
2010-05-16 10:49:58 -04:00
int ret ;
2007-09-17 10:58:06 -04:00
unsigned long nr ;
2007-08-10 14:06:19 -04:00
2010-05-16 10:49:58 -04:00
if ( xchg ( & root - > defrag_running , 1 ) )
2007-08-10 14:06:19 -04:00
return 0 ;
2010-05-16 10:49:58 -04:00
2007-10-15 16:17:34 -04:00
while ( 1 ) {
2010-05-16 10:49:58 -04:00
trans = btrfs_start_transaction ( root , 0 ) ;
if ( IS_ERR ( trans ) )
return PTR_ERR ( trans ) ;
2007-08-10 14:06:19 -04:00
ret = btrfs_defrag_leaves ( trans , root , cacheonly ) ;
2010-05-16 10:49:58 -04:00
2007-09-17 10:58:06 -04:00
nr = trans - > blocks_used ;
2007-08-10 14:06:19 -04:00
btrfs_end_transaction ( trans , root ) ;
2007-09-17 10:58:06 -04:00
btrfs_btree_balance_dirty ( info - > tree_root , nr ) ;
2007-08-10 14:06:19 -04:00
cond_resched ( ) ;
2011-05-31 18:07:27 +02:00
if ( btrfs_fs_closing ( root - > fs_info ) | | ret ! = - EAGAIN )
2007-08-10 14:06:19 -04:00
break ;
}
root - > defrag_running = 0 ;
2010-05-16 10:49:58 -04:00
return ret ;
2007-08-10 14:06:19 -04:00
}
2008-09-29 15:18:18 -04:00
/*
* new snapshots need to be created at a very specific time in the
* transaction commit . This does the actual creation
*/
2008-02-01 16:35:04 -05:00
static noinline int create_pending_snapshot ( struct btrfs_trans_handle * trans ,
2008-01-08 15:46:30 -05:00
struct btrfs_fs_info * fs_info ,
struct btrfs_pending_snapshot * pending )
{
struct btrfs_key key ;
2008-02-01 16:35:04 -05:00
struct btrfs_root_item * new_root_item ;
2008-01-08 15:46:30 -05:00
struct btrfs_root * tree_root = fs_info - > tree_root ;
struct btrfs_root * root = pending - > root ;
2010-03-15 17:27:13 +00:00
struct btrfs_root * parent_root ;
2011-09-11 10:52:24 -04:00
struct btrfs_block_rsv * rsv ;
2010-03-15 17:27:13 +00:00
struct inode * parent_inode ;
2010-11-20 09:48:00 +00:00
struct dentry * parent ;
2010-05-16 10:48:46 -04:00
struct dentry * dentry ;
2008-01-08 15:46:30 -05:00
struct extent_buffer * tmp ;
2008-06-25 16:01:30 -04:00
struct extent_buffer * old ;
2008-01-08 15:46:30 -05:00
int ret ;
2010-05-16 10:49:58 -04:00
u64 to_reserve = 0 ;
2010-03-15 17:27:13 +00:00
u64 index = 0 ;
2010-05-16 10:48:46 -04:00
u64 objectid ;
2010-12-20 16:04:08 +08:00
u64 root_flags ;
2008-01-08 15:46:30 -05:00
2011-09-11 10:52:24 -04:00
rsv = trans - > block_rsv ;
2008-02-01 16:35:04 -05:00
new_root_item = kmalloc ( sizeof ( * new_root_item ) , GFP_NOFS ) ;
if ( ! new_root_item ) {
2010-05-16 10:48:46 -04:00
pending - > error = - ENOMEM ;
2008-02-01 16:35:04 -05:00
goto fail ;
}
2010-05-16 10:48:46 -04:00
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
ret = btrfs_find_free_objectid ( tree_root , & objectid ) ;
2010-05-16 10:48:46 -04:00
if ( ret ) {
pending - > error = ret ;
2008-01-08 15:46:30 -05:00
goto fail ;
2010-05-16 10:48:46 -04:00
}
2008-01-08 15:46:30 -05:00
2010-05-16 10:49:59 -04:00
btrfs_reloc_pre_snapshot ( trans , pending , & to_reserve ) ;
2010-05-16 10:49:58 -04:00
if ( to_reserve > 0 ) {
2011-11-10 20:45:05 -05:00
ret = btrfs_block_rsv_add_noflush ( root , & pending - > block_rsv ,
to_reserve ) ;
2010-05-16 10:49:58 -04:00
if ( ret ) {
pending - > error = ret ;
goto fail ;
}
}
2008-01-08 15:46:30 -05:00
key . objectid = objectid ;
2010-05-16 10:48:46 -04:00
key . offset = ( u64 ) - 1 ;
key . type = BTRFS_ROOT_ITEM_KEY ;
2008-01-08 15:46:30 -05:00
2010-05-16 10:48:46 -04:00
trans - > block_rsv = & pending - > block_rsv ;
2008-11-17 21:02:50 -05:00
2010-05-16 10:48:46 -04:00
dentry = pending - > dentry ;
2010-11-20 09:48:00 +00:00
parent = dget_parent ( dentry ) ;
parent_inode = parent - > d_inode ;
2010-05-16 10:48:46 -04:00
parent_root = BTRFS_I ( parent_inode ) - > root ;
2011-06-13 20:00:16 -04:00
record_root_in_trans ( trans , parent_root ) ;
2010-05-16 10:48:46 -04:00
2008-01-08 15:46:30 -05:00
/*
* insert the directory item
*/
2008-11-17 21:02:50 -05:00
ret = btrfs_set_inode_index ( parent_inode , & index ) ;
2010-03-15 17:27:13 +00:00
BUG_ON ( ret ) ;
2008-11-17 20:37:39 -05:00
ret = btrfs_insert_dir_item ( trans , parent_root ,
2010-05-16 10:48:46 -04:00
dentry - > d_name . name , dentry - > d_name . len ,
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
parent_inode , & key ,
2010-05-16 10:48:46 -04:00
BTRFS_FT_DIR , index ) ;
2012-02-20 08:40:56 -05:00
if ( ret ) {
pending - > error = - EEXIST ;
dput ( parent ) ;
goto fail ;
}
2008-11-17 20:37:39 -05:00
2010-05-16 10:48:46 -04:00
btrfs_i_size_write ( parent_inode , parent_inode - > i_size +
dentry - > d_name . len * 2 ) ;
2009-01-05 15:43:43 -05:00
ret = btrfs_update_inode ( trans , parent_root , parent_inode ) ;
BUG_ON ( ret ) ;
2011-06-17 16:14:09 -04:00
/*
* pull in the delayed directory update
* and the delayed inode item
* otherwise we corrupt the FS during
* snapshot
*/
ret = btrfs_run_delayed_items ( trans , root ) ;
BUG_ON ( ret ) ;
2011-06-13 20:00:16 -04:00
record_root_in_trans ( trans , root ) ;
2010-03-15 17:27:13 +00:00
btrfs_set_root_last_snapshot ( & root - > root_item , trans - > transid ) ;
memcpy ( new_root_item , & root - > root_item , sizeof ( * new_root_item ) ) ;
2011-03-28 02:01:25 +00:00
btrfs_check_and_init_root_item ( new_root_item ) ;
2010-03-15 17:27:13 +00:00
2010-12-20 16:04:08 +08:00
root_flags = btrfs_root_flags ( new_root_item ) ;
if ( pending - > readonly )
root_flags | = BTRFS_ROOT_SUBVOL_RDONLY ;
else
root_flags & = ~ BTRFS_ROOT_SUBVOL_RDONLY ;
btrfs_set_root_flags ( new_root_item , root_flags ) ;
2010-03-15 17:27:13 +00:00
old = btrfs_lock_root_node ( root ) ;
btrfs_cow_block ( trans , root , old , NULL , 0 , & old ) ;
btrfs_set_lock_blocking ( old ) ;
btrfs_copy_root ( trans , root , old , & tmp , objectid ) ;
btrfs_tree_unlock ( old ) ;
free_extent_buffer ( old ) ;
2011-11-14 20:48:06 -05:00
/* see comments in should_cow_block() */
root - > force_cow = 1 ;
smp_wmb ( ) ;
2010-03-15 17:27:13 +00:00
btrfs_set_root_node ( new_root_item , tmp ) ;
2010-05-16 10:48:46 -04:00
/* record when the snapshot was created in key.offset */
key . offset = trans - > transid ;
ret = btrfs_insert_root ( trans , tree_root , & key , new_root_item ) ;
2010-03-15 17:27:13 +00:00
btrfs_tree_unlock ( tmp ) ;
free_extent_buffer ( tmp ) ;
2010-05-16 10:48:46 -04:00
BUG_ON ( ret ) ;
2010-03-15 17:27:13 +00:00
2010-05-16 10:48:46 -04:00
/*
* insert root back / forward references
*/
ret = btrfs_add_root_ref ( trans , tree_root , objectid ,
2008-11-17 20:37:39 -05:00
parent_root - > root_key . objectid ,
2011-04-20 10:31:50 +08:00
btrfs_ino ( parent_inode ) , index ,
2010-05-16 10:48:46 -04:00
dentry - > d_name . name , dentry - > d_name . len ) ;
2008-11-17 20:37:39 -05:00
BUG_ON ( ret ) ;
2010-11-20 09:48:00 +00:00
dput ( parent ) ;
2008-11-17 20:37:39 -05:00
2010-05-16 10:48:46 -04:00
key . offset = ( u64 ) - 1 ;
pending - > snap = btrfs_read_fs_root_no_name ( root - > fs_info , & key ) ;
BUG_ON ( IS_ERR ( pending - > snap ) ) ;
2010-05-16 10:49:58 -04:00
2010-05-16 10:49:59 -04:00
btrfs_reloc_post_snapshot ( trans , pending ) ;
2008-01-08 15:46:30 -05:00
fail :
2010-03-15 17:27:13 +00:00
kfree ( new_root_item ) ;
2011-09-11 10:52:24 -04:00
trans - > block_rsv = rsv ;
2010-05-16 10:48:46 -04:00
btrfs_block_rsv_release ( root , & pending - > block_rsv , ( u64 ) - 1 ) ;
return 0 ;
2008-01-08 15:46:30 -05:00
}
2008-09-29 15:18:18 -04:00
/*
* create all the snapshots we ' ve scheduled for creation
*/
2008-02-01 16:35:04 -05:00
static noinline int create_pending_snapshots ( struct btrfs_trans_handle * trans ,
struct btrfs_fs_info * fs_info )
2008-11-17 21:02:50 -05:00
{
struct btrfs_pending_snapshot * pending ;
struct list_head * head = & trans - > transaction - > pending_snapshots ;
2012-02-20 08:40:56 -05:00
list_for_each_entry ( pending , head , list )
create_pending_snapshot ( trans , fs_info , pending ) ;
2008-11-17 21:02:50 -05:00
return 0 ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
static void update_super_roots ( struct btrfs_root * root )
{
struct btrfs_root_item * root_item ;
struct btrfs_super_block * super ;
2011-04-13 15:41:04 +02:00
super = root - > fs_info - > super_copy ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
root_item = & root - > fs_info - > chunk_root - > root_item ;
super - > chunk_root = root_item - > bytenr ;
super - > chunk_root_generation = root_item - > generation ;
super - > chunk_root_level = root_item - > level ;
root_item = & root - > fs_info - > tree_root - > root_item ;
super - > root = root_item - > bytenr ;
super - > generation = root_item - > generation ;
super - > root_level = root_item - > level ;
2011-10-03 14:07:49 -04:00
if ( btrfs_test_opt ( root , SPACE_CACHE ) )
2010-06-21 14:48:16 -04:00
super - > cache_generation = root_item - > generation ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
}
2009-07-30 10:04:48 -04:00
int btrfs_transaction_in_commit ( struct btrfs_fs_info * info )
{
int ret = 0 ;
2011-04-11 17:25:13 -04:00
spin_lock ( & info - > trans_lock ) ;
2009-07-30 10:04:48 -04:00
if ( info - > running_transaction )
ret = info - > running_transaction - > in_commit ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & info - > trans_lock ) ;
2009-07-30 10:04:48 -04:00
return ret ;
}
2010-05-16 10:49:58 -04:00
int btrfs_transaction_blocked ( struct btrfs_fs_info * info )
{
int ret = 0 ;
2011-04-11 17:25:13 -04:00
spin_lock ( & info - > trans_lock ) ;
2010-05-16 10:49:58 -04:00
if ( info - > running_transaction )
ret = info - > running_transaction - > blocked ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & info - > trans_lock ) ;
2010-05-16 10:49:58 -04:00
return ret ;
}
2010-10-29 15:37:34 -04:00
/*
* wait for the current transaction commit to start and block subsequent
* transaction joins
*/
static void wait_current_trans_commit_start ( struct btrfs_root * root ,
struct btrfs_transaction * trans )
{
2011-07-14 03:17:00 +00:00
wait_event ( root - > fs_info - > transaction_blocked_wait , trans - > in_commit ) ;
2010-10-29 15:37:34 -04:00
}
/*
* wait for the current transaction to start and then become unblocked .
* caller holds ref .
*/
static void wait_current_trans_commit_start_and_unblock ( struct btrfs_root * root ,
struct btrfs_transaction * trans )
{
2011-07-14 03:17:00 +00:00
wait_event ( root - > fs_info - > transaction_wait ,
trans - > commit_done | | ( trans - > in_commit & & ! trans - > blocked ) ) ;
2010-10-29 15:37:34 -04:00
}
/*
* commit transactions asynchronously . once btrfs_commit_transaction_async
* returns , any subsequent transaction will not be allowed to join .
*/
struct btrfs_async_commit {
struct btrfs_trans_handle * newtrans ;
struct btrfs_root * root ;
struct delayed_work work ;
} ;
static void do_async_commit ( struct work_struct * work )
{
struct btrfs_async_commit * ac =
container_of ( work , struct btrfs_async_commit , work . work ) ;
btrfs_commit_transaction ( ac - > newtrans , ac - > root ) ;
kfree ( ac ) ;
}
int btrfs_commit_transaction_async ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
int wait_for_unblock )
{
struct btrfs_async_commit * ac ;
struct btrfs_transaction * cur_trans ;
ac = kmalloc ( sizeof ( * ac ) , GFP_NOFS ) ;
2011-03-23 08:14:16 +00:00
if ( ! ac )
return - ENOMEM ;
2010-10-29 15:37:34 -04:00
INIT_DELAYED_WORK ( & ac - > work , do_async_commit ) ;
ac - > root = root ;
2011-04-13 12:54:33 -04:00
ac - > newtrans = btrfs_join_transaction ( root ) ;
2011-01-25 02:51:38 +00:00
if ( IS_ERR ( ac - > newtrans ) ) {
int err = PTR_ERR ( ac - > newtrans ) ;
kfree ( ac ) ;
return err ;
}
2010-10-29 15:37:34 -04:00
/* take transaction reference */
cur_trans = trans - > transaction ;
2011-04-11 15:45:29 -04:00
atomic_inc ( & cur_trans - > use_count ) ;
2010-10-29 15:37:34 -04:00
btrfs_end_transaction ( trans , root ) ;
schedule_delayed_work ( & ac - > work , 0 ) ;
/* wait for transaction to start and unblock */
if ( wait_for_unblock )
wait_current_trans_commit_start_and_unblock ( root , cur_trans ) ;
else
wait_current_trans_commit_start ( root , cur_trans ) ;
2011-06-10 18:43:13 +00:00
if ( current - > journal_info = = trans )
current - > journal_info = NULL ;
put_transaction ( cur_trans ) ;
2010-10-29 15:37:34 -04:00
return 0 ;
}
/*
* btrfs_transaction state sequence :
* in_commit = 0 , blocked = 0 ( initial )
* in_commit = 1 , blocked = 1
* blocked = 0
* commit_done = 1
*/
2007-03-22 15:59:16 -04:00
int btrfs_commit_transaction ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
2007-08-10 16:22:09 -04:00
unsigned long joined = 0 ;
2007-03-22 15:59:16 -04:00
struct btrfs_transaction * cur_trans ;
2007-04-19 21:01:03 -04:00
struct btrfs_transaction * prev_trans = NULL ;
2007-03-22 15:59:16 -04:00
DEFINE_WAIT ( wait ) ;
2007-08-10 16:22:09 -04:00
int ret ;
2009-03-12 20:12:45 -04:00
int should_grow = 0 ;
unsigned long now = get_seconds ( ) ;
2009-04-02 16:59:01 -04:00
int flush_on_commit = btrfs_test_opt ( root , FLUSHONCOMMIT ) ;
2007-03-22 15:59:16 -04:00
2009-03-31 13:27:11 -04:00
btrfs_run_ordered_operations ( root , 0 ) ;
2011-10-14 14:40:17 -04:00
btrfs_trans_release_metadata ( trans , root ) ;
2011-09-19 11:58:54 -04:00
trans - > block_rsv = NULL ;
2009-03-13 10:10:06 -04:00
/* make a pass through all the delayed refs we have so far
* any runnings procs may add more while we are here
*/
ret = btrfs_run_delayed_refs ( trans , root , 0 ) ;
BUG_ON ( ret ) ;
2009-03-12 20:12:45 -04:00
cur_trans = trans - > transaction ;
2009-03-13 10:10:06 -04:00
/*
* set the flushing flag so procs in this transaction have to
* start sending their work down .
*/
2009-03-12 20:12:45 -04:00
cur_trans - > delayed_refs . flushing = 1 ;
2009-03-13 10:10:06 -04:00
2009-03-13 10:17:05 -04:00
ret = btrfs_run_delayed_refs ( trans , root , 0 ) ;
2009-03-13 10:10:06 -04:00
BUG_ON ( ret ) ;
2011-04-11 17:25:13 -04:00
spin_lock ( & cur_trans - > commit_lock ) ;
2009-03-12 20:12:45 -04:00
if ( cur_trans - > in_commit ) {
2011-04-11 17:25:13 -04:00
spin_unlock ( & cur_trans - > commit_lock ) ;
2011-04-11 15:45:29 -04:00
atomic_inc ( & cur_trans - > use_count ) ;
2007-03-22 15:59:16 -04:00
btrfs_end_transaction ( trans , root ) ;
2007-06-28 15:57:36 -04:00
2011-07-14 03:17:14 +00:00
wait_for_commit ( root , cur_trans ) ;
2007-08-10 16:22:09 -04:00
2007-03-22 15:59:16 -04:00
put_transaction ( cur_trans ) ;
2007-08-10 16:22:09 -04:00
2007-03-22 15:59:16 -04:00
return 0 ;
}
2008-01-03 09:08:48 -05:00
2007-04-02 10:50:19 -04:00
trans - > transaction - > in_commit = 1 ;
2008-07-17 12:54:14 -04:00
trans - > transaction - > blocked = 1 ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & cur_trans - > commit_lock ) ;
2010-10-29 15:37:34 -04:00
wake_up ( & root - > fs_info - > transaction_blocked_wait ) ;
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
2007-06-28 15:57:36 -04:00
if ( cur_trans - > list . prev ! = & root - > fs_info - > trans_list ) {
prev_trans = list_entry ( cur_trans - > list . prev ,
struct btrfs_transaction , list ) ;
if ( ! prev_trans - > commit_done ) {
2011-04-11 15:45:29 -04:00
atomic_inc ( & prev_trans - > use_count ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2007-06-28 15:57:36 -04:00
wait_for_commit ( root , prev_trans ) ;
2007-08-10 16:22:09 -04:00
put_transaction ( prev_trans ) ;
2011-04-11 17:25:13 -04:00
} else {
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2007-06-28 15:57:36 -04:00
}
2011-04-11 17:25:13 -04:00
} else {
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2007-06-28 15:57:36 -04:00
}
2007-08-10 16:22:09 -04:00
2009-03-12 20:12:45 -04:00
if ( now < cur_trans - > start_time | | now - cur_trans - > start_time < 1 )
should_grow = 1 ;
2007-08-10 16:22:09 -04:00
do {
2008-08-05 13:05:02 -04:00
int snap_pending = 0 ;
2011-04-11 17:25:13 -04:00
2007-08-10 16:22:09 -04:00
joined = cur_trans - > num_joined ;
2008-08-05 13:05:02 -04:00
if ( ! list_empty ( & trans - > transaction - > pending_snapshots ) )
snap_pending = 1 ;
2007-04-02 10:50:19 -04:00
WARN_ON ( cur_trans ! = trans - > transaction ) ;
2007-08-10 16:22:09 -04:00
2010-02-19 14:13:50 -08:00
if ( flush_on_commit | | snap_pending ) {
2009-11-12 09:36:34 +00:00
btrfs_start_delalloc_inodes ( root , 1 ) ;
ret = btrfs_wait_ordered_extents ( root , 0 , 1 ) ;
2009-07-24 13:17:44 -04:00
BUG_ON ( ret ) ;
2008-08-05 13:05:02 -04:00
}
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
ret = btrfs_run_delayed_items ( trans , root ) ;
BUG_ON ( ret ) ;
2009-03-31 13:27:11 -04:00
/*
* rename don ' t use btrfs_join_transaction , so , once we
* set the transaction to blocked above , we aren ' t going
* to get any new ordered operations . We can safely run
* it here and no for sure that nothing new will be added
* to the list
*/
btrfs_run_ordered_operations ( root , 1 ) ;
2010-05-25 10:12:41 -04:00
prepare_to_wait ( & cur_trans - > writer_wait , & wait ,
TASK_UNINTERRUPTIBLE ) ;
2011-04-11 15:45:29 -04:00
if ( atomic_read ( & cur_trans - > num_writers ) > 1 )
2010-10-29 15:37:34 -04:00
schedule_timeout ( MAX_SCHEDULE_TIMEOUT ) ;
else if ( should_grow )
schedule_timeout ( 1 ) ;
2007-08-10 16:22:09 -04:00
finish_wait ( & cur_trans - > writer_wait , & wait ) ;
2011-04-11 15:45:29 -04:00
} while ( atomic_read ( & cur_trans - > num_writers ) > 1 | |
2009-03-12 20:12:45 -04:00
( should_grow & & cur_trans - > num_joined ! = joined ) ) ;
2007-08-10 16:22:09 -04:00
2011-06-14 16:22:15 -04:00
/*
* Ok now we need to make sure to block out any other joins while we
* commit the transaction . We could have started a join before setting
* no_join so make sure to wait for num_writers to = = 1 again .
*/
spin_lock ( & root - > fs_info - > trans_lock ) ;
root - > fs_info - > trans_no_join = 1 ;
spin_unlock ( & root - > fs_info - > trans_lock ) ;
wait_event ( cur_trans - > writer_wait ,
atomic_read ( & cur_trans - > num_writers ) = = 1 ) ;
2011-06-13 20:00:16 -04:00
/*
* the reloc mutex makes sure that we stop
* the balancing code from coming in and moving
* extents around in the middle of the commit
*/
mutex_lock ( & root - > fs_info - > reloc_mutex ) ;
2011-06-17 16:14:09 -04:00
ret = btrfs_run_delayed_items ( trans , root ) ;
2008-01-08 15:46:30 -05:00
BUG_ON ( ret ) ;
2011-06-17 16:14:09 -04:00
ret = create_pending_snapshots ( trans , root - > fs_info ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
BUG_ON ( ret ) ;
2009-03-13 10:10:06 -04:00
ret = btrfs_run_delayed_refs ( trans , root , ( unsigned long ) - 1 ) ;
BUG_ON ( ret ) ;
2011-06-17 16:14:09 -04:00
/*
* make sure none of the code above managed to slip in a
* delayed item
*/
btrfs_assert_delayed_root_empty ( root ) ;
2007-04-02 10:50:19 -04:00
WARN_ON ( cur_trans ! = trans - > transaction ) ;
2008-01-08 15:46:30 -05:00
2011-03-08 14:14:00 +01:00
btrfs_scrub_pause ( root ) ;
2008-09-05 16:13:11 -04:00
/* btrfs_commit_tree_roots is responsible for getting the
* various roots consistent with each other . Every pointer
* in the tree of tree roots has to point to the most up to date
* root for every subvolume and other tree . So , we have to keep
* the tree logging code from jumping in and changing any
* of the trees .
*
* At this point in the commit , there can ' t be any tree - log
* writers , but a little lower down we drop the trans mutex
* and let new people in . By holding the tree_log_mutex
* from now until after the super is written , we avoid races
* with the tree - log code .
*/
mutex_lock ( & root - > fs_info - > tree_log_mutex ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
ret = commit_fs_roots ( trans , root ) ;
2007-06-22 14:16:25 -04:00
BUG_ON ( ret ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
/* commit_fs_roots gets rid of all the tree log roots, it is now
2008-09-05 16:13:11 -04:00
* safe to free the root of tree log roots
*/
btrfs_free_log_root_tree ( trans , root - > fs_info ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
ret = commit_cowonly_roots ( trans , root ) ;
2007-03-22 15:59:16 -04:00
BUG_ON ( ret ) ;
2007-06-22 14:16:25 -04:00
2009-09-11 16:11:19 -04:00
btrfs_prepare_extent_commit ( trans , root ) ;
2007-03-25 11:35:08 -04:00
cur_trans = root - > fs_info - > running_transaction ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
btrfs_set_root_node ( & root - > fs_info - > tree_root - > root_item ,
root - > fs_info - > tree_root - > node ) ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
switch_commit_root ( root - > fs_info - > tree_root ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
btrfs_set_root_node ( & root - > fs_info - > chunk_root - > root_item ,
root - > fs_info - > chunk_root - > node ) ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
switch_commit_root ( root - > fs_info - > chunk_root ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
update_super_roots ( root ) ;
2008-09-05 16:13:11 -04:00
if ( ! root - > fs_info - > log_root_recovering ) {
2011-04-13 15:41:04 +02:00
btrfs_set_super_log_root ( root - > fs_info - > super_copy , 0 ) ;
btrfs_set_super_log_root_level ( root - > fs_info - > super_copy , 0 ) ;
2008-09-05 16:13:11 -04:00
}
2011-04-13 15:41:04 +02:00
memcpy ( root - > fs_info - > super_for_commit , root - > fs_info - > super_copy ,
sizeof ( * root - > fs_info - > super_copy ) ) ;
2007-06-28 15:57:36 -04:00
2008-07-17 12:54:14 -04:00
trans - > transaction - > blocked = 0 ;
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
root - > fs_info - > running_transaction = NULL ;
root - > fs_info - > trans_no_join = 0 ;
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2011-06-13 20:00:16 -04:00
mutex_unlock ( & root - > fs_info - > reloc_mutex ) ;
2009-03-12 20:12:45 -04:00
2008-07-17 12:54:14 -04:00
wake_up ( & root - > fs_info - > transaction_wait ) ;
2008-07-17 12:53:50 -04:00
2007-03-22 15:59:16 -04:00
ret = btrfs_write_and_wait_transaction ( trans , root ) ;
BUG_ON ( ret ) ;
2008-12-08 16:46:26 -05:00
write_ctree_super ( trans , root , 0 ) ;
2008-01-03 09:08:48 -05:00
2008-09-05 16:13:11 -04:00
/*
* the super is written , we can safely allow the tree - loggers
* to go about their business
*/
mutex_unlock ( & root - > fs_info - > tree_log_mutex ) ;
2009-09-11 16:11:19 -04:00
btrfs_finish_extent_commit ( trans , root ) ;
2008-01-03 09:08:48 -05:00
2007-04-02 10:50:19 -04:00
cur_trans - > commit_done = 1 ;
2009-03-12 20:12:45 -04:00
2007-08-10 16:22:09 -04:00
root - > fs_info - > last_trans_committed = cur_trans - > transid ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
2007-04-02 10:50:19 -04:00
wake_up ( & cur_trans - > commit_wait ) ;
2008-11-17 21:02:50 -05:00
2011-04-11 17:25:13 -04:00
spin_lock ( & root - > fs_info - > trans_lock ) ;
2011-04-11 15:45:29 -04:00
list_del_init ( & cur_trans - > list ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & root - > fs_info - > trans_lock ) ;
2007-03-25 11:35:08 -04:00
put_transaction ( cur_trans ) ;
2007-03-22 15:59:16 -04:00
put_transaction ( cur_trans ) ;
2007-08-29 15:47:34 -04:00
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 11:18:59 +00:00
trace_btrfs_transaction_commit ( root ) ;
2011-03-08 14:14:00 +01:00
btrfs_scrub_continue ( root ) ;
2009-09-11 16:12:44 -04:00
if ( current - > journal_info = = trans )
current - > journal_info = NULL ;
2007-04-02 10:50:19 -04:00
kmem_cache_free ( btrfs_trans_handle_cachep , trans ) ;
2009-11-12 09:36:34 +00:00
if ( current ! = root - > fs_info - > transaction_kthread )
btrfs_run_delayed_iputs ( root ) ;
2007-03-22 15:59:16 -04:00
return ret ;
}
2008-09-29 15:18:18 -04:00
/*
* interface function to delete all the snapshots we have scheduled for deletion
*/
2007-08-10 14:06:19 -04:00
int btrfs_clean_old_snapshots ( struct btrfs_root * root )
{
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
LIST_HEAD ( list ) ;
struct btrfs_fs_info * fs_info = root - > fs_info ;
2011-04-11 17:25:13 -04:00
spin_lock ( & fs_info - > trans_lock ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
list_splice_init ( & fs_info - > dead_roots , & list ) ;
2011-04-11 17:25:13 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2007-08-10 14:06:19 -04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
while ( ! list_empty ( & list ) ) {
root = list_entry ( list . next , struct btrfs_root , root_list ) ;
2009-09-21 16:00:26 -04:00
list_del ( & root - > root_list ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
btrfs_kill_all_delayed_nodes ( root ) ;
2009-09-21 16:00:26 -04:00
if ( btrfs_header_backref_rev ( root - > node ) <
BTRFS_MIXED_BACKREF_REV )
2011-09-12 15:26:38 +02:00
btrfs_drop_snapshot ( root , NULL , 0 , 0 ) ;
2009-09-21 16:00:26 -04:00
else
2011-09-12 15:26:38 +02:00
btrfs_drop_snapshot ( root , NULL , 1 , 0 ) ;
2007-08-10 14:06:19 -04:00
}
return 0 ;
}