2007-06-12 17:07:21 +04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation .
*
* This program is distributed in the hope that it will be useful ,
* but WITHOUT ANY WARRANTY ; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE . See the GNU
* General Public License for more details .
*
* You should have received a copy of the GNU General Public
* License along with this program ; if not , write to the
* Free Software Foundation , Inc . , 59 Temple Place - Suite 330 ,
* Boston , MA 021110 - 1307 , USA .
*/
2008-01-08 23:46:30 +03:00
# ifndef __BTRFS_CTREE__
# define __BTRFS_CTREE__
2007-02-02 17:18:22 +03:00
2007-10-16 00:18:55 +04:00
# include <linux/mm.h>
2017-02-02 21:15:33 +03:00
# include <linux/sched/signal.h>
2007-10-16 00:18:55 +04:00
# include <linux/highmem.h>
2007-03-22 19:13:20 +03:00
# include <linux/fs.h>
2011-03-08 16:14:00 +03:00
# include <linux/rwsem.h>
2013-08-15 19:11:21 +04:00
# include <linux/semaphore.h>
2007-08-29 23:47:34 +04:00
# include <linux/completion.h>
2008-03-26 17:28:07 +03:00
# include <linux/backing-dev.h>
2008-07-17 20:53:50 +04:00
# include <linux/wait.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2011-01-12 12:30:42 +03:00
# include <linux/kobject.h>
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 14:18:59 +03:00
# include <trace/events/btrfs.h>
2007-10-16 00:14:27 +04:00
# include <asm/kmap_types.h>
2011-09-21 23:05:58 +04:00
# include <linux/pagemap.h>
2013-01-29 10:04:50 +04:00
# include <linux/btrfs.h>
2016-04-01 23:14:29 +03:00
# include <linux/btrfs_tree.h>
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 04:29:04 +04:00
# include <linux/workqueue.h>
2014-09-23 09:40:08 +04:00
# include <linux/security.h>
2015-12-14 19:42:10 +03:00
# include <linux/sizes.h>
2016-09-01 06:55:33 +03:00
# include <linux/dynamic_debug.h>
2017-03-03 11:55:14 +03:00
# include <linux/refcount.h>
2008-01-25 00:13:08 +03:00
# include "extent_io.h"
2007-10-16 00:14:19 +04:00
# include "extent_map.h"
2008-06-12 00:50:36 +04:00
# include "async-thread.h"
2007-03-22 19:13:20 +03:00
2007-03-16 23:20:31 +03:00
struct btrfs_trans_handle ;
2007-03-22 22:59:16 +03:00
struct btrfs_transaction ;
2010-05-16 18:48:46 +04:00
struct btrfs_pending_snapshot ;
2007-05-02 23:53:43 +04:00
extern struct kmem_cache * btrfs_trans_handle_cachep ;
extern struct kmem_cache * btrfs_bit_radix_cachep ;
2007-04-02 18:50:19 +04:00
extern struct kmem_cache * btrfs_path_cachep ;
2011-01-29 01:05:48 +03:00
extern struct kmem_cache * btrfs_free_space_cachep ;
2008-07-17 20:53:50 +04:00
struct btrfs_ordered_sum ;
2007-03-16 23:20:31 +03:00
2013-10-09 20:00:56 +04:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
# define STATIC noinline
# else
# define STATIC static noinline
# endif
2013-02-20 04:55:13 +04:00
# define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
2007-02-02 17:18:22 +03:00
2012-11-06 17:57:46 +04:00
# define BTRFS_MAX_MIRRORS 3
2012-03-27 22:21:26 +04:00
2009-02-12 22:09:45 +03:00
# define BTRFS_MAX_LEVEL 8
2008-03-24 22:01:56 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
# define BTRFS_COMPAT_EXTENT_TREE_V0
2010-08-06 21:21:20 +04:00
/*
* the max metadata block size . This limit is somewhat artificial ,
* but the memmove costs go through the roof for larger blocks .
*/
# define BTRFS_MAX_METADATA_BLOCKSIZE 65536
2007-03-22 19:13:20 +03:00
/*
* we can actually store much bigger names , but lets not confuse the rest
* of linux
*/
# define BTRFS_NAME_LEN 255
2012-08-08 22:32:27 +04:00
/*
* Theoretical limit is larger , but we keep this down to a sane
* value . That should limit greatly the possibility of collisions on
* inode ref items .
*/
# define BTRFS_LINK_MAX 65535U
2015-11-19 13:42:31 +03:00
static const int btrfs_csum_sizes [ ] = { 4 } ;
2008-12-02 15:17:45 +03:00
2007-05-10 20:36:17 +04:00
/* four bytes for CRC32 */
2007-12-12 22:38:19 +03:00
# define BTRFS_EMPTY_DIR_SIZE 0
2007-03-29 23:15:27 +04:00
2012-02-03 14:20:04 +04:00
/* ioprio of readahead is set to idle */
# define BTRFS_IOPRIO_READA (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0))
2015-12-14 19:42:10 +03:00
# define BTRFS_DIRTY_METADATA_THRESH SZ_32M
2013-01-29 14:09:20 +04:00
2015-12-14 19:42:10 +03:00
# define BTRFS_MAX_EXTENT_SIZE SZ_128M
2015-02-11 23:08:59 +03:00
2017-01-04 13:09:51 +03:00
/*
* Count how many BTRFS_MAX_EXTENT_SIZE cover the @ size
*/
static inline u32 count_max_extents ( u64 size )
{
return div_u64 ( size + BTRFS_MAX_EXTENT_SIZE - 1 , BTRFS_MAX_EXTENT_SIZE ) ;
}
2008-03-24 22:01:56 +03:00
struct btrfs_mapping_tree {
struct extent_map_tree map_tree ;
} ;
static inline unsigned long btrfs_chunk_item_size ( int num_stripes )
{
BUG_ON ( num_stripes = = 0 ) ;
return sizeof ( struct btrfs_chunk ) +
sizeof ( struct btrfs_stripe ) * ( num_stripes - 1 ) ;
}
2011-01-06 14:30:25 +03:00
/*
* File system states
*/
2013-01-29 14:14:48 +04:00
# define BTRFS_FS_STATE_ERROR 0
2013-02-21 10:32:52 +04:00
# define BTRFS_FS_STATE_REMOUNTING 1
2013-03-12 18:46:08 +04:00
# define BTRFS_FS_STATE_TRANS_ABORTED 2
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 12:46:55 +04:00
# define BTRFS_FS_STATE_DEV_REPLACING 3
2016-06-20 21:14:09 +03:00
# define BTRFS_FS_STATE_DUMMY_FS_INFO 4
2011-01-06 14:30:25 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
# define BTRFS_BACKREF_REV_MAX 256
# define BTRFS_BACKREF_REV_SHIFT 56
# define BTRFS_BACKREF_REV_MASK (((u64)BTRFS_BACKREF_REV_MAX - 1) << \
BTRFS_BACKREF_REV_SHIFT )
# define BTRFS_OLD_BACKREF_REV 0
# define BTRFS_MIXED_BACKREF_REV 1
2008-04-01 19:21:32 +04:00
2007-02-26 18:40:21 +03:00
/*
* every tree block ( leaf or node ) starts with this header .
*/
2007-03-12 19:29:44 +03:00
struct btrfs_header {
2008-04-15 23:41:47 +04:00
/* these first four must match the super block */
2007-03-29 23:15:27 +04:00
u8 csum [ BTRFS_CSUM_SIZE ] ;
2007-10-16 00:14:19 +04:00
u8 fsid [ BTRFS_FSID_SIZE ] ; /* FS specific uuid */
2007-10-16 00:15:53 +04:00
__le64 bytenr ; /* which block this node is supposed to live in */
2008-04-01 19:21:32 +04:00
__le64 flags ;
2008-04-15 23:41:47 +04:00
/* allowed to be different from the super from here on down */
u8 chunk_tree_uuid [ BTRFS_UUID_SIZE ] ;
2007-03-23 22:56:19 +03:00
__le64 generation ;
2007-04-21 04:23:12 +04:00
__le64 owner ;
2007-10-16 00:14:19 +04:00
__le32 nritems ;
2007-03-27 17:06:38 +04:00
u8 level ;
2007-02-02 17:18:22 +03:00
} __attribute__ ( ( __packed__ ) ) ;
2008-03-24 22:01:56 +03:00
/*
* this is a very generous portion of the super block , giving us
* room to translate 14 chunks with 3 stripes each .
*/
# define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
2011-11-03 23:17:42 +04:00
/*
* just in case we somehow lose the roots and are not able to mount ,
* we store an array of the roots from previous transactions
* in the super .
*/
# define BTRFS_NUM_BACKUP_ROOTS 4
struct btrfs_root_backup {
__le64 tree_root ;
__le64 tree_root_gen ;
__le64 chunk_root ;
__le64 chunk_root_gen ;
__le64 extent_root ;
__le64 extent_root_gen ;
__le64 fs_root ;
__le64 fs_root_gen ;
__le64 dev_root ;
__le64 dev_root_gen ;
__le64 csum_root ;
__le64 csum_root_gen ;
__le64 total_bytes ;
__le64 bytes_used ;
__le64 num_devices ;
/* future */
2012-10-31 19:16:32 +04:00
__le64 unused_64 [ 4 ] ;
2011-11-03 23:17:42 +04:00
u8 tree_root_level ;
u8 chunk_root_level ;
u8 extent_root_level ;
u8 fs_root_level ;
u8 dev_root_level ;
u8 csum_root_level ;
/* future and to align */
u8 unused_8 [ 10 ] ;
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 18:40:21 +03:00
/*
* the super block basically lists the main trees of the FS
* it currently lacks any block count etc etc
*/
2007-03-13 17:46:10 +03:00
struct btrfs_super_block {
2007-03-29 23:15:27 +04:00
u8 csum [ BTRFS_CSUM_SIZE ] ;
2008-04-01 19:21:32 +04:00
/* the first 4 fields must match struct btrfs_header */
2008-11-18 05:11:30 +03:00
u8 fsid [ BTRFS_FSID_SIZE ] ; /* FS specific uuid */
2007-10-16 00:15:53 +04:00
__le64 bytenr ; /* this block number */
2008-04-01 19:21:32 +04:00
__le64 flags ;
2008-04-15 23:41:47 +04:00
/* allowed to be different from the btrfs_header from here own down */
2007-03-13 23:47:54 +03:00
__le64 magic ;
__le64 generation ;
__le64 root ;
2008-03-24 22:01:56 +03:00
__le64 chunk_root ;
2008-09-06 00:13:11 +04:00
__le64 log_root ;
2008-12-09 00:40:21 +03:00
/* this will help find the new super based on the log root */
__le64 log_root_transid ;
2007-10-16 00:15:53 +04:00
__le64 total_bytes ;
__le64 bytes_used ;
2007-03-21 18:12:56 +03:00
__le64 root_dir_objectid ;
2008-03-24 22:02:07 +03:00
__le64 num_devices ;
2007-10-16 00:14:19 +04:00
__le32 sectorsize ;
__le32 nodesize ;
2014-06-04 21:22:26 +04:00
__le32 __unused_leafsize ;
2007-11-30 19:30:34 +03:00
__le32 stripesize ;
2008-03-24 22:01:56 +03:00
__le32 sys_chunk_array_size ;
2008-10-29 21:49:05 +03:00
__le64 chunk_root_generation ;
2008-12-02 14:36:08 +03:00
__le64 compat_flags ;
__le64 compat_ro_flags ;
__le64 incompat_flags ;
2008-12-02 15:17:45 +03:00
__le16 csum_type ;
2007-10-16 00:15:53 +04:00
u8 root_level ;
2008-03-24 22:01:56 +03:00
u8 chunk_root_level ;
2008-09-06 00:13:11 +04:00
u8 log_root_level ;
2008-03-24 22:02:07 +03:00
struct btrfs_dev_item dev_item ;
2008-12-09 00:40:21 +03:00
2008-04-18 18:29:49 +04:00
char label [ BTRFS_LABEL_SIZE ] ;
2008-12-09 00:40:21 +03:00
2010-06-21 22:48:16 +04:00
__le64 cache_generation ;
2013-08-15 19:11:22 +04:00
__le64 uuid_tree_generation ;
2010-06-21 22:48:16 +04:00
2008-12-09 00:40:21 +03:00
/* future expansion */
2013-08-15 19:11:22 +04:00
__le64 reserved [ 30 ] ;
2008-03-24 22:01:56 +03:00
u8 sys_chunk_array [ BTRFS_SYSTEM_CHUNK_ARRAY_SIZE ] ;
2011-11-03 23:17:42 +04:00
struct btrfs_root_backup super_roots [ BTRFS_NUM_BACKUP_ROOTS ] ;
2007-02-22 01:04:57 +03:00
} __attribute__ ( ( __packed__ ) ) ;
2008-12-02 14:36:08 +03:00
/*
* Compat flags that we support . If any incompat flags are set other than the
* ones specified below then we will fail to mount
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
# define BTRFS_FEATURE_COMPAT_SUPP 0ULL
2013-11-16 00:33:55 +04:00
# define BTRFS_FEATURE_COMPAT_SAFE_SET 0ULL
# define BTRFS_FEATURE_COMPAT_SAFE_CLEAR 0ULL
2015-09-30 06:50:38 +03:00
# define BTRFS_FEATURE_COMPAT_RO_SUPP \
2016-09-23 03:24:22 +03:00
( BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE | \
BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID )
2015-09-30 06:50:38 +03:00
2013-11-16 00:33:55 +04:00
# define BTRFS_FEATURE_COMPAT_RO_SAFE_SET 0ULL
# define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
2010-06-21 22:48:16 +04:00
# define BTRFS_FEATURE_INCOMPAT_SUPP \
( BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \
2010-09-17 00:19:09 +04:00
BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL | \
2010-10-25 11:12:26 +04:00
BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \
2010-08-06 21:21:20 +04:00
BTRFS_FEATURE_INCOMPAT_BIG_METADATA | \
2012-08-08 22:32:27 +04:00
BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \
btrfs: Add zstd support
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.
I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].
I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.
The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.
| Method | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None | 0.99 | 504 | 686 |
| lzo | 1.66 | 398 | 442 |
| zlib | 2.58 | 65 | 241 |
| zstd 1 | 2.57 | 260 | 383 |
| zstd 3 | 2.71 | 174 | 408 |
| zstd 6 | 2.87 | 70 | 398 |
| zstd 9 | 2.92 | 43 | 406 |
| zstd 12 | 2.93 | 21 | 408 |
| zstd 15 | 3.01 | 11 | 354 |
The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.
| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 |
| lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 |
| zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 |
| zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 |
[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz
zstd source repository: https://github.com/facebook/zstd
Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-08-10 05:39:02 +03:00
BTRFS_FEATURE_INCOMPAT_COMPRESS_ZSTD | \
2013-01-30 03:40:14 +04:00
BTRFS_FEATURE_INCOMPAT_RAID56 | \
2013-03-07 23:22:04 +04:00
BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
2013-10-22 20:18:51 +04:00
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
BTRFS_FEATURE_INCOMPAT_NO_HOLES )
2008-12-02 14:36:08 +03:00
2013-11-16 00:33:55 +04:00
# define BTRFS_FEATURE_INCOMPAT_SAFE_SET \
( BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF )
# define BTRFS_FEATURE_INCOMPAT_SAFE_CLEAR 0ULL
2008-12-02 14:36:08 +03:00
2007-02-26 18:40:21 +03:00
/*
2007-03-15 19:56:47 +03:00
* A leaf is full of items . offset and size tell us where to find
2007-02-26 18:40:21 +03:00
* the item in the leaf ( relative to the start of the data area )
*/
2007-03-13 03:12:07 +03:00
struct btrfs_item {
2007-03-12 23:22:34 +03:00
struct btrfs_disk_key key ;
2007-03-14 21:14:43 +03:00
__le32 offset ;
2007-10-16 00:14:19 +04:00
__le32 size ;
2007-02-02 17:18:22 +03:00
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 18:40:21 +03:00
/*
* leaves have an item area and a data area :
* [ item0 , item1 . . . . itemN ] [ free space ] [ dataN . . . data1 , data0 ]
*
* The data is separate from the items to get the keys closer together
* during searches .
*/
2007-03-13 17:46:10 +03:00
struct btrfs_leaf {
2007-03-12 19:29:44 +03:00
struct btrfs_header header ;
2007-03-14 21:14:43 +03:00
struct btrfs_item items [ ] ;
2007-02-02 17:18:22 +03:00
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 18:40:21 +03:00
/*
* all non - leaf blocks are nodes , they hold only keys and pointers to
* other blocks
*/
2007-03-14 21:14:43 +03:00
struct btrfs_key_ptr {
struct btrfs_disk_key key ;
__le64 blockptr ;
2007-12-11 17:25:06 +03:00
__le64 generation ;
2007-03-14 21:14:43 +03:00
} __attribute__ ( ( __packed__ ) ) ;
2007-03-13 17:46:10 +03:00
struct btrfs_node {
2007-03-12 19:29:44 +03:00
struct btrfs_header header ;
2007-03-14 21:14:43 +03:00
struct btrfs_key_ptr ptrs [ ] ;
2007-02-02 17:18:22 +03:00
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 18:40:21 +03:00
/*
2007-03-13 17:46:10 +03:00
* btrfs_paths remember the path taken from the root down to the leaf .
* level 0 is always the leaf , and nodes [ 1. . . BTRFS_MAX_LEVEL ] will point
2007-02-26 18:40:21 +03:00
* to any other levels that are present .
*
* The slots array records the index of the item or block pointer
* used while walking the tree .
*/
2015-11-27 18:31:35 +03:00
enum { READA_NONE = 0 , READA_BACK , READA_FORWARD } ;
2007-03-13 17:46:10 +03:00
struct btrfs_path {
2007-10-16 00:14:19 +04:00
struct extent_buffer * nodes [ BTRFS_MAX_LEVEL ] ;
2007-03-13 17:46:10 +03:00
int slots [ BTRFS_MAX_LEVEL ] ;
2008-06-26 00:01:30 +04:00
/* if there is real range locking, this locks field will change */
2015-11-27 18:31:45 +03:00
u8 locks [ BTRFS_MAX_LEVEL ] ;
2015-11-27 18:31:38 +03:00
u8 reada ;
2008-06-26 00:01:30 +04:00
/* keep some upper locks as we walk down */
2015-11-27 18:31:42 +03:00
u8 lowest_level ;
2008-12-10 17:10:46 +03:00
/*
* set by btrfs_split_item , tells search_slot to keep all locks
* and to force calls to keep space in the nodes
*/
2009-03-13 18:00:37 +03:00
unsigned int search_for_split : 1 ;
unsigned int keep_locks : 1 ;
unsigned int skip_locking : 1 ;
unsigned int leave_spinning : 1 ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
unsigned int search_commit_root : 1 ;
2014-03-29 01:16:01 +04:00
unsigned int need_commit_sem : 1 ;
2014-11-09 11:38:39 +03:00
unsigned int skip_release_on_error : 1 ;
2007-02-02 17:18:22 +03:00
} ;
2016-06-15 16:22:56 +03:00
# define BTRFS_MAX_EXTENT_ITEM_SIZE(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info) >> 4) - \
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
sizeof ( struct btrfs_item ) )
2012-11-05 20:26:40 +04:00
struct btrfs_dev_replace {
u64 replace_state ; /* see #define above */
u64 time_started ; /* seconds since 1-Jan-1970 */
u64 time_stopped ; /* seconds since 1-Jan-1970 */
atomic64_t num_write_errors ;
atomic64_t num_uncorrectable_read_errors ;
u64 cursor_left ;
u64 committed_cursor_left ;
u64 cursor_left_last_write_of_item ;
u64 cursor_right ;
u64 cont_reading_from_srcdev_mode ; /* see #define above */
int is_valid ;
int item_needs_writeback ;
struct btrfs_device * srcdev ;
struct btrfs_device * tgtdev ;
pid_t lock_owner ;
atomic_t nesting_level ;
struct mutex lock_finishing_cancel_unmount ;
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 11:49:19 +03:00
rwlock_t lock ;
atomic_t read_locks ;
atomic_t blocking_readers ;
wait_queue_head_t read_lock_wq ;
2012-11-05 20:26:40 +04:00
struct btrfs_scrub_progress scrub_progress ;
} ;
2014-05-27 20:59:57 +04:00
/* For raid type sysfs entries */
struct raid_kobject {
int raid_type ;
struct kobject kobj ;
} ;
2008-03-24 22:01:59 +03:00
struct btrfs_space_info {
2014-01-15 16:00:54 +04:00
spinlock_t lock ;
2009-02-20 19:00:09 +03:00
2010-10-14 22:52:27 +04:00
u64 total_bytes ; /* total bytes in the space,
this doesn ' t take mirrors into account */
2010-05-16 18:46:24 +04:00
u64 bytes_used ; /* total bytes used,
2011-04-27 10:28:26 +04:00
this doesn ' t take mirrors into account */
2009-02-20 19:00:09 +03:00
u64 bytes_pinned ; /* total bytes pinned, will be freed when the
transaction finishes */
u64 bytes_reserved ; /* total bytes the allocator has reserved for
current allocations */
u64 bytes_may_use ; /* number of bytes that may be used for
2009-09-12 00:12:44 +04:00
delalloc / allocations */
2014-01-15 16:00:54 +04:00
u64 bytes_readonly ; /* total bytes that are read only */
2015-09-29 18:40:47 +03:00
u64 max_extent_size ; /* This will hold the maximum extent size of
the space info if we had an ENOSPC in the
allocator . */
2014-01-15 16:00:54 +04:00
unsigned int full : 1 ; /* indicates that we cannot allocate any more
chunks for this space */
unsigned int chunk_alloc : 1 ; /* set if we are allocating a chunk */
unsigned int flush : 1 ; /* set if we are trying to make space */
unsigned int force_alloc ; /* set if we need to force a chunk
alloc for this space */
2010-05-16 18:46:24 +04:00
u64 disk_used ; /* total bytes used on disk */
2010-10-14 22:52:27 +04:00
u64 disk_total ; /* total bytes on disk, takes mirrors into
account */
2009-02-20 19:00:09 +03:00
2014-01-15 16:00:54 +04:00
u64 flags ;
2013-06-19 23:00:04 +04:00
/*
* bytes_pinned is kept in line with what is actually pinned , as in
* we ' ve called update_block_group and dropped the bytes_used counter
* and increased the bytes_pinned counter . However this means that
* bytes_pinned does not reflect the bytes that will be pinned once the
2016-03-04 22:23:12 +03:00
* delayed refs are flushed , so this counter is inc ' ed every time we
* call btrfs_free_extent so it is a realtime count of what will be
2016-05-20 04:18:45 +03:00
* freed once the transaction is committed . It will be zeroed every
2016-03-04 22:23:12 +03:00
* time the transaction commits .
2013-06-19 23:00:04 +04:00
*/
struct percpu_counter total_bytes_pinned ;
2008-03-24 22:01:59 +03:00
struct list_head list ;
2015-01-16 16:24:40 +03:00
/* Protected by the spinlock 'lock'. */
2014-10-31 16:49:34 +03:00
struct list_head ro_bgs ;
2016-05-17 20:30:55 +03:00
struct list_head priority_tickets ;
struct list_head tickets ;
2016-11-07 10:59:16 +03:00
/*
* tickets_id just indicates the next ticket will be handled , so note
* it ' s not stored per ticket .
*/
btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
In btrfs_async_reclaim_metadata_space(), we use ticket's address to
determine whether asynchronous metadata reclaim work is making progress.
ticket = list_first_entry(&space_info->tickets,
struct reserve_ticket, list);
if (last_ticket == ticket) {
flush_state++;
} else {
last_ticket = ticket;
flush_state = FLUSH_DELAYED_ITEMS_NR;
if (commit_cycles)
commit_cycles--;
}
But indeed it's wrong, we should not rely on local variable's address to
do this check, because addresses may be same. In my test environment, I
dd one 168MB file in a 256MB fs, found that for this file, every time
wait_reserve_ticket() called, local variable ticket's address is same,
For above codes, assume a previous ticket's address is addrA, last_ticket
is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
wake up it, then another ticket is added, but with the same address addrA,
now last_ticket will be same to current ticket, then current ticket's flush
work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
which may result in some enospc issues(I have seen this in my test machine).
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-02 05:58:46 +03:00
u64 tickets_id ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
2014-01-15 16:00:54 +04:00
struct rw_semaphore groups_sem ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
/* for block groups in our same type */
2010-05-16 18:46:24 +04:00
struct list_head block_groups [ BTRFS_NR_RAID_TYPES ] ;
2011-06-08 00:07:44 +04:00
wait_queue_head_t wait ;
2013-11-01 21:07:04 +04:00
struct kobject kobj ;
2014-05-27 20:59:57 +04:00
struct kobject * block_group_kobjs [ BTRFS_NR_RAID_TYPES ] ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
} ;
2012-09-06 14:02:28 +04:00
# define BTRFS_BLOCK_RSV_GLOBAL 1
# define BTRFS_BLOCK_RSV_DELALLOC 2
# define BTRFS_BLOCK_RSV_TRANS 3
# define BTRFS_BLOCK_RSV_CHUNK 4
# define BTRFS_BLOCK_RSV_DELOPS 5
# define BTRFS_BLOCK_RSV_EMPTY 6
# define BTRFS_BLOCK_RSV_TEMP 7
2010-05-16 18:46:25 +04:00
struct btrfs_block_rsv {
u64 size ;
u64 reserved ;
struct btrfs_space_info * space_info ;
spinlock_t lock ;
2012-09-06 14:02:28 +04:00
unsigned short full ;
unsigned short type ;
unsigned short failfast ;
2010-05-16 18:46:25 +04:00
} ;
2009-04-03 17:47:43 +04:00
/*
* free clusters are used to claim free space in relatively large chunks ,
btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-28 09:31:28 +03:00
* allowing us to do less seeky writes . They are used for all metadata
* allocations . In ssd_spread mode they are also used for data allocations .
2009-04-03 17:47:43 +04:00
*/
struct btrfs_free_cluster {
spinlock_t lock ;
spinlock_t refill_lock ;
struct rb_root root ;
/* largest extent in this cluster */
u64 max_size ;
/* first extent starting offset */
u64 window_start ;
2015-10-02 22:25:10 +03:00
/* We did a full search and couldn't create a cluster */
bool fragmented ;
2009-04-03 17:47:43 +04:00
struct btrfs_block_group_cache * block_group ;
/*
* when a cluster is allocated from a block group , we put the
* cluster onto a list in the block group so that it can
* be freed before the block group is freed .
*/
struct list_head block_group_list ;
2008-03-24 22:01:59 +03:00
} ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 05:29:25 +04:00
enum btrfs_caching_type {
BTRFS_CACHE_NO = 0 ,
BTRFS_CACHE_STARTED = 1 ,
2011-11-14 22:52:14 +04:00
BTRFS_CACHE_FAST = 2 ,
BTRFS_CACHE_FINISHED = 3 ,
2013-08-05 19:15:21 +04:00
BTRFS_CACHE_ERROR = 4 ,
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 05:29:25 +04:00
} ;
2010-06-21 22:48:16 +04:00
enum btrfs_disk_cache_state {
BTRFS_DC_WRITTEN = 0 ,
BTRFS_DC_ERROR = 1 ,
BTRFS_DC_CLEAR = 2 ,
BTRFS_DC_SETUP = 3 ,
} ;
2009-09-12 00:11:19 +04:00
struct btrfs_caching_control {
struct list_head list ;
struct mutex mutex ;
wait_queue_head_t wait ;
2011-06-30 22:42:28 +04:00
struct btrfs_work work ;
2009-09-12 00:11:19 +04:00
struct btrfs_block_group_cache * block_group ;
u64 progress ;
2017-03-03 11:55:14 +03:00
refcount_t count ;
2009-09-12 00:11:19 +04:00
} ;
2015-09-30 06:50:33 +03:00
/* Once caching_thread() finds this much free space, it will wake up waiters. */
2017-10-16 16:48:40 +03:00
# define CACHING_CTL_WAKE_UP SZ_2M
2015-09-30 06:50:33 +03:00
2015-04-06 23:17:20 +03:00
struct btrfs_io_ctl {
void * cur , * orig ;
struct page * page ;
struct page * * pages ;
2016-06-23 01:56:18 +03:00
struct btrfs_fs_info * fs_info ;
2015-04-05 03:14:42 +03:00
struct inode * inode ;
2015-04-06 23:17:20 +03:00
unsigned long size ;
int index ;
int num_pages ;
2015-04-05 03:14:42 +03:00
int entries ;
int bitmaps ;
2015-04-06 23:17:20 +03:00
unsigned check_crcs : 1 ;
} ;
2017-04-14 03:35:54 +03:00
/*
* Tree to record all locked full stripes of a RAID5 / 6 block group
*/
struct btrfs_full_stripe_locks_tree {
struct rb_root root ;
struct mutex lock ;
} ;
2007-04-27 00:46:15 +04:00
struct btrfs_block_group_cache {
struct btrfs_key key ;
struct btrfs_block_group_item item ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 05:29:25 +04:00
struct btrfs_fs_info * fs_info ;
2010-06-21 22:48:16 +04:00
struct inode * inode ;
2008-07-23 07:06:41 +04:00
spinlock_t lock ;
2007-11-16 22:57:08 +03:00
u64 pinned ;
2008-09-26 18:05:48 +04:00
u64 reserved ;
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 06:42:50 +04:00
u64 delalloc_bytes ;
2009-09-12 00:11:20 +04:00
u64 bytes_super ;
2008-03-24 22:01:56 +03:00
u64 flags ;
2011-10-06 16:58:24 +04:00
u64 cache_generation ;
2015-09-30 06:50:35 +03:00
/*
* If the free space extent count exceeds this number , convert the block
* group to bitmaps .
*/
u32 bitmap_high_thresh ;
/*
* If the free space extent count drops below this number , convert the
* block group back to extents .
*/
u32 bitmap_low_thresh ;
2013-01-30 03:40:14 +04:00
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 06:42:50 +04:00
/*
* It is just used for the delayed data space allocation because
* only the data space allocation and the relative metadata update
* can be done cross the transaction .
*/
struct rw_semaphore data_rwsem ;
2013-01-30 03:40:14 +04:00
/* for raid56, this is a full stripe, without parity */
unsigned long full_stripe_len ;
2015-08-05 11:43:27 +03:00
unsigned int ro ;
2010-11-20 15:03:07 +03:00
unsigned int iref : 1 ;
2014-11-26 18:28:51 +03:00
unsigned int has_caching_ctl : 1 ;
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 00:14:15 +03:00
unsigned int removed : 1 ;
2010-06-21 22:48:16 +04:00
int disk_cache_state ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 05:29:25 +04:00
/* cache tracking stuff */
int cached ;
2009-09-12 00:11:19 +04:00
struct btrfs_caching_control * caching_ctl ;
u64 last_byte_to_unpin ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 05:29:25 +04:00
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
struct btrfs_space_info * space_info ;
/* free space cache stuff */
2011-03-29 09:46:06 +04:00
struct btrfs_free_space_ctl * free_space_ctl ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
/* block group cache stuff */
struct rb_node cache_node ;
/* for block groups in the same raid type */
struct list_head list ;
2008-12-12 00:30:39 +03:00
/* usage count */
atomic_t count ;
2009-04-03 17:47:43 +04:00
/* List of struct btrfs_free_clusters for this block group.
* Today it will only have one thing on it , but that may change
*/
struct list_head cluster_list ;
2012-09-12 00:57:25 +04:00
2014-09-18 19:20:02 +04:00
/* For delayed block group creation or deletion of empty block groups */
struct list_head bg_list ;
2014-10-31 16:49:34 +03:00
/* For read-only block groups */
struct list_head ro_list ;
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 00:14:15 +03:00
atomic_t trimming ;
2014-11-17 23:45:48 +03:00
/* For dirty block groups */
struct list_head dirty_list ;
2015-04-05 03:14:42 +03:00
struct list_head io_list ;
struct btrfs_io_ctl io_ctl ;
2015-09-30 06:50:35 +03:00
Btrfs: don't do unnecessary delalloc flushes when relocating
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).
These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).
So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.
This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:
CPU 1 CPU 2
btrfs_relocate_block_group(bg X)
starts direct IO write,
target inode currently has no
ordered extents ongoing nor
dirty pages (delalloc regions),
therefore the root for our inode
is not in the list
fs_info->ordered_roots
btrfs_direct_IO()
__blockdev_direct_IO()
btrfs_get_blocks_direct()
btrfs_lock_extent_direct()
locks range in the io tree
btrfs_new_extent_direct()
btrfs_reserve_extent()
--> extent allocated
from bg X
btrfs_inc_block_group_ro(bg X)
btrfs_start_delalloc_roots()
__start_delalloc_inodes()
--> does nothing, no dealloc ranges
in the inode's io tree so the
inode's root is not in the list
fs_info->delalloc_roots
btrfs_wait_ordered_roots()
--> does not find the inode's root in the
list fs_info->ordered_roots
--> ends up not waiting for the direct IO
write started by the task at CPU 2
relocate_block_group(rc->stage ==
MOVE_DATA_EXTENTS)
prepare_to_relocate()
btrfs_commit_transaction()
iterates the extent tree, using its
commit root and moves extents into new
locations
btrfs_add_ordered_extent_dio()
--> now a ordered extent is
created and added to the
list root->ordered_extents
and the root added to the
list fs_info->ordered_roots
--> this is too late and the
task at CPU 1 already
started the relocation
btrfs_commit_transaction()
btrfs_finish_ordered_io()
btrfs_alloc_reserved_file_extent()
--> adds delayed data reference
for the extent allocated
from bg X
relocate_block_group(rc->stage ==
UPDATE_DATA_PTRS)
prepare_to_relocate()
btrfs_commit_transaction()
--> delayed refs are run, so an extent
item for the allocated extent from
bg X is added to extent tree
--> commit roots are switched, so the
next scan in the extent tree will
see the extent item
sees the extent in the extent tree
When this happens the relocation produces the following warning when it
finishes:
[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:
[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP <ffff8802046878f0>
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-04-26 17:39:32 +03:00
/*
* Incremented when doing extent allocations and holding a read lock
* on the space_info ' s groups_sem semaphore .
* Decremented when an ordered extent that represents an IO against this
* block group ' s range is created ( after it ' s added to its inode ' s
* root ' s list of ordered extents ) or immediately after the allocation
* if it ' s a metadata extent or fallocate extent ( for these cases we
* don ' t create ordered extents ) .
*/
atomic_t reservations ;
2016-05-09 15:15:41 +03:00
/*
* Incremented while holding the spinlock * lock * by a task checking if
* it can perform a nocow write ( incremented if the value for the * ro *
* field is 0 ) . Decremented by such tasks once they create an ordered
* extent or before that if some error happens before reaching that step .
* This is to prevent races between block group relocation and nocow
* writes through direct IO .
*/
atomic_t nocow_writers ;
2015-09-30 06:50:35 +03:00
/* Lock for free space tree operations. */
struct mutex free_space_lock ;
/*
* Does the block group need to be added to the free space tree ?
* Protected by free_space_lock .
*/
int needs_free_space ;
2017-04-14 03:35:54 +03:00
/* Record locked full stripes for RAID5/6 block group */
struct btrfs_full_stripe_locks_tree full_stripe_locks_root ;
2007-04-27 00:46:15 +04:00
} ;
2008-03-24 22:01:56 +03:00
2012-06-21 13:08:04 +04:00
/* delayed seq elem */
struct seq_list {
struct list_head list ;
u64 seq ;
} ;
2015-02-25 17:47:32 +03:00
# define SEQ_LIST_INIT(name) { .list = LIST_HEAD_INIT((name).list), .seq = 0 }
2017-03-16 19:04:34 +03:00
# define SEQ_LAST ((u64)-1)
2013-02-08 01:06:02 +04:00
enum btrfs_orphan_cleanup_state {
ORPHAN_CLEANUP_STARTED = 1 ,
ORPHAN_CLEANUP_DONE = 2 ,
} ;
2013-01-30 03:40:14 +04:00
/* used by the raid56 code to lock stripes for read/modify/write */
struct btrfs_stripe_hash {
struct list_head hash_list ;
spinlock_t lock ;
} ;
/* used by the raid56 code to lock stripes for read/modify/write */
struct btrfs_stripe_hash_table {
2013-01-31 23:42:09 +04:00
struct list_head stripe_cache ;
spinlock_t cache_lock ;
int cache_size ;
struct btrfs_stripe_hash table [ ] ;
2013-01-30 03:40:14 +04:00
} ;
# define BTRFS_STRIPE_HASH_TABLE_BITS 11
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 04:29:04 +04:00
void btrfs_init_async_reclaim_work ( struct work_struct * work ) ;
2012-06-21 13:08:04 +04:00
/* fs_info */
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
struct reloc_control ;
2008-03-24 22:01:56 +03:00
struct btrfs_device ;
2008-03-24 22:02:07 +03:00
struct btrfs_fs_devices ;
2012-01-17 00:04:47 +04:00
struct btrfs_balance_control ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
struct btrfs_delayed_root ;
2016-09-02 22:40:02 +03:00
# define BTRFS_FS_BARRIER 1
# define BTRFS_FS_CLOSING_START 2
# define BTRFS_FS_CLOSING_DONE 3
# define BTRFS_FS_LOG_RECOVERING 4
# define BTRFS_FS_OPEN 5
# define BTRFS_FS_QUOTA_ENABLED 6
# define BTRFS_FS_QUOTA_ENABLING 7
# define BTRFS_FS_UPDATE_UUID_TREE_GEN 9
# define BTRFS_FS_CREATING_FREE_SPACE_TREE 10
# define BTRFS_FS_BTREE_ERR 11
# define BTRFS_FS_LOG1_ERR 12
# define BTRFS_FS_LOG2_ERR 13
2017-05-12 00:17:33 +03:00
# define BTRFS_FS_QUOTA_OVERRIDE 14
2017-06-15 20:10:03 +03:00
/* Used to record internally whether fs has been frozen */
# define BTRFS_FS_FROZEN 15
2017-05-12 00:17:33 +03:00
2017-03-28 15:44:21 +03:00
/*
* Indicate that a whole - filesystem exclusive operation is running
* ( device replace , resize , device add / delete , balance )
*/
2017-10-04 05:05:17 +03:00
# define BTRFS_FS_EXCL_OP 16
2016-09-02 22:40:02 +03:00
2007-03-20 21:38:32 +03:00
struct btrfs_fs_info {
2007-10-16 00:14:19 +04:00
u8 fsid [ BTRFS_FSID_SIZE ] ;
2008-04-15 23:41:47 +04:00
u8 chunk_tree_uuid [ BTRFS_UUID_SIZE ] ;
2016-09-02 22:40:02 +03:00
unsigned long flags ;
2007-03-15 19:56:47 +03:00
struct btrfs_root * extent_root ;
struct btrfs_root * tree_root ;
2008-03-24 22:01:56 +03:00
struct btrfs_root * chunk_root ;
struct btrfs_root * dev_root ;
2008-11-18 05:02:50 +03:00
struct btrfs_root * fs_root ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 00:58:54 +03:00
struct btrfs_root * csum_root ;
2011-09-13 14:56:09 +04:00
struct btrfs_root * quota_root ;
2013-08-15 19:11:19 +04:00
struct btrfs_root * uuid_root ;
2015-09-30 06:50:35 +03:00
struct btrfs_root * free_space_root ;
2008-09-06 00:13:11 +04:00
/* the log root tree is a directory of all the other log roots */
struct btrfs_root * log_root_tree ;
2009-09-21 23:56:00 +04:00
spinlock_t fs_roots_radix_lock ;
2007-04-09 18:42:37 +04:00
struct radix_tree_root fs_roots_radix ;
2007-10-16 00:15:26 +04:00
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
/* block group cache stuff */
spinlock_t block_group_cache_lock ;
2012-12-27 13:01:23 +04:00
u64 first_logical_byte ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
struct rb_root block_group_cache_tree ;
2011-09-27 01:12:22 +04:00
/* keep track of unallocated space */
2017-05-11 09:17:46 +03:00
atomic64_t free_chunk_space ;
2011-09-27 01:12:22 +04:00
2009-09-12 00:11:19 +04:00
struct extent_io_tree freed_extents [ 2 ] ;
struct extent_io_tree * pinned_extents ;
2007-10-16 00:15:26 +04:00
2008-03-24 22:01:56 +03:00
/* logical->physical extent mapping */
struct btrfs_mapping_tree mapping_tree ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
/*
* block reservation for extent , checksum , root tree and
* delayed dir index item
*/
2010-05-16 18:46:25 +04:00
struct btrfs_block_rsv global_block_rsv ;
/* block reservation for metadata operations */
struct btrfs_block_rsv trans_block_rsv ;
/* block reservation for chunk tree */
struct btrfs_block_rsv chunk_block_rsv ;
2011-11-04 06:54:25 +04:00
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv ;
2010-05-16 18:46:25 +04:00
struct btrfs_block_rsv empty_block_rsv ;
2007-03-20 22:57:25 +03:00
u64 generation ;
2007-08-11 00:22:09 +04:00
u64 last_trans_committed ;
2014-01-23 19:54:11 +04:00
u64 avg_delayed_ref_runtime ;
2009-03-24 17:24:20 +03:00
/*
* this is updated to the current trans every time a full commit
* is required instead of the faster short fsync log commits
*/
u64 last_trans_log_full_commit ;
2012-03-30 15:58:32 +04:00
unsigned long mount_opt ;
2014-02-05 18:26:17 +04:00
/*
* Track requests for actions that need to be done during transaction
* commit ( like for some mount options ) .
*/
unsigned long pending_changes ;
2010-12-17 09:21:50 +03:00
unsigned long compress_type : 4 ;
2017-09-15 18:36:57 +03:00
unsigned int compress_level ;
2013-08-01 20:14:52 +04:00
int commit_interval ;
2013-01-29 14:05:05 +04:00
/*
* It is a suggestive number , the read side is safe even it gets a
* wrong number because we will write out the data into a regular
* extent . The write side ( mount / remount ) is under - > s_umount lock ,
* so it is also safe .
*/
2008-01-30 00:03:38 +03:00
u64 max_inline ;
2017-06-15 02:30:06 +03:00
2007-03-22 22:59:16 +03:00
struct btrfs_transaction * running_transaction ;
2008-07-17 20:53:50 +04:00
wait_queue_head_t transaction_throttle ;
2008-07-17 20:54:14 +04:00
wait_queue_head_t transaction_wait ;
2010-10-29 23:37:34 +04:00
wait_queue_head_t transaction_blocked_wait ;
2008-11-07 06:02:51 +03:00
wait_queue_head_t async_submit_wait ;
2008-09-06 00:13:11 +04:00
2013-04-11 14:30:16 +04:00
/*
* Used to protect the incompat_flags , compat_flags , compat_ro_flags
* when they are updated .
*
* Because we do not clear the flags for ever , so we needn ' t use
* the lock on the read side .
*
* We also needn ' t use the lock when we mount the fs , because
* there is no other task which will update the flag .
*/
spinlock_t super_lock ;
2011-04-13 17:41:04 +04:00
struct btrfs_super_block * super_copy ;
struct btrfs_super_block * super_for_commit ;
2007-03-22 19:13:20 +03:00
struct super_block * sb ;
2007-03-28 21:57:48 +04:00
struct inode * btree_inode ;
2008-09-06 00:13:11 +04:00
struct mutex tree_log_mutex ;
2008-06-26 00:01:31 +04:00
struct mutex transaction_kthread_mutex ;
struct mutex cleaner_mutex ;
2008-06-26 00:01:30 +04:00
struct mutex chunk_mutex ;
2008-07-08 22:19:17 +04:00
struct mutex volume_mutex ;
2013-01-30 03:40:14 +04:00
2015-04-06 22:46:08 +03:00
/*
* this is taken to make sure we don ' t set block groups ro after
* the free space cache has been allocated on them
*/
struct mutex ro_block_group_mutex ;
2013-01-30 03:40:14 +04:00
/* this is used during read/modify/write to make sure
* no two ios are trying to mod the same stripe at the same
* time
*/
struct btrfs_stripe_hash_table * stripe_hash_table ;
2009-03-31 21:27:11 +04:00
/*
* this protects the ordered operations list only while we are
* processing all of the entries on it . This way we make
* sure the commit code doesn ' t find the list temporarily empty
* because another function happens to be doing non - waiting preflush
* before jumping into the main commit .
*/
struct mutex ordered_operations_mutex ;
2013-08-14 19:33:56 +04:00
2014-03-13 23:42:13 +04:00
struct rw_semaphore commit_root_sem ;
2009-03-31 21:27:11 +04:00
2009-11-12 12:34:40 +03:00
struct rw_semaphore cleanup_work_sem ;
2009-09-22 00:00:26 +04:00
2009-11-12 12:34:40 +03:00
struct rw_semaphore subvol_sem ;
2009-09-22 00:00:26 +04:00
struct srcu_struct subvol_srcu ;
2011-04-12 01:25:13 +04:00
spinlock_t trans_lock ;
2011-06-14 04:00:16 +04:00
/*
* the reloc mutex goes with the trans lock , it is taken
* during commit to protect us from the relocation code
*/
struct mutex reloc_mutex ;
2007-04-20 05:01:03 +04:00
struct list_head trans_list ;
2007-06-09 02:11:48 +04:00
struct list_head dead_roots ;
2009-09-12 00:11:19 +04:00
struct list_head caching_block_groups ;
2008-09-06 00:13:11 +04:00
2009-11-12 12:36:34 +03:00
spinlock_t delayed_iput_lock ;
struct list_head delayed_iputs ;
Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:
[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
[ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108
[ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
[ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d
[ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c
[ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266
[ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
[ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.
Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.
Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-15 14:05:12 +03:00
struct mutex cleaner_delayed_iput_mutex ;
2009-11-12 12:36:34 +03:00
2012-05-16 19:55:38 +04:00
/* this protects tree_mod_seq_list */
spinlock_t tree_mod_seq_lock ;
2013-04-24 20:57:33 +04:00
atomic64_t tree_mod_seq ;
2012-05-16 19:55:38 +04:00
struct list_head tree_mod_seq_list ;
/* this protects tree_mod_log */
rwlock_t tree_mod_log_lock ;
struct rb_root tree_mod_log ;
2008-11-07 06:02:51 +03:00
atomic_t async_delalloc_pages ;
2011-04-12 01:25:13 +04:00
atomic_t open_ioctl_trans ;
2008-04-10 00:28:12 +04:00
2008-07-24 19:57:52 +04:00
/*
2013-05-15 11:48:23 +04:00
* this is used to protect the following list - - ordered_roots .
2008-07-24 19:57:52 +04:00
*/
2013-05-15 11:48:23 +04:00
spinlock_t ordered_root_lock ;
2009-03-31 21:27:11 +04:00
/*
2013-05-15 11:48:23 +04:00
* all fs / file tree roots in which there are data = ordered extents
* pending writeback are added into this list .
*
2009-03-31 21:27:11 +04:00
* these can span multiple transactions and basically include
* every dirty data page that isn ' t from nodatacow
*/
2013-05-15 11:48:23 +04:00
struct list_head ordered_roots ;
2009-03-31 21:27:11 +04:00
2014-03-06 09:55:03 +04:00
struct mutex delalloc_root_mutex ;
2013-05-15 11:48:22 +04:00
spinlock_t delalloc_root_lock ;
/* all fs/file tree roots that have delalloc inodes. */
struct list_head delalloc_roots ;
2008-07-24 19:57:52 +04:00
2008-06-12 00:50:36 +04:00
/*
* there is a pool of worker threads for checksumming during writes
* and a pool for checksumming after reads . This is because readers
* can run with FS locks held , and the writers may be waiting for
* those locks . We don ' t want ordering in the pending list to cause
* deadlocks , and so the two are serviced separately .
2008-06-12 22:46:17 +04:00
*
* A third pool does submit_bio to avoid deadlocking with the other
* two
2008-06-12 00:50:36 +04:00
*/
2014-02-28 06:46:19 +04:00
struct btrfs_workqueue * workers ;
struct btrfs_workqueue * delalloc_workers ;
struct btrfs_workqueue * flush_workers ;
struct btrfs_workqueue * endio_workers ;
struct btrfs_workqueue * endio_meta_workers ;
struct btrfs_workqueue * endio_raid56_workers ;
2014-09-12 14:44:03 +04:00
struct btrfs_workqueue * endio_repair_workers ;
2014-02-28 06:46:19 +04:00
struct btrfs_workqueue * rmw_workers ;
struct btrfs_workqueue * endio_meta_write_workers ;
struct btrfs_workqueue * endio_write_workers ;
struct btrfs_workqueue * endio_freespace_worker ;
struct btrfs_workqueue * submit_workers ;
struct btrfs_workqueue * caching_workers ;
struct btrfs_workqueue * readahead_workers ;
2011-06-30 22:42:28 +04:00
2008-07-17 20:53:51 +04:00
/*
* fixup workers take dirty pages that didn ' t properly go through
* the cow mechanism and make them safe to write . It happens
* for the sys_munmap function call path
*/
2014-02-28 06:46:19 +04:00
struct btrfs_workqueue * fixup_workers ;
struct btrfs_workqueue * delayed_workers ;
2014-05-23 03:18:52 +04:00
/* the extent workers do delayed refs on the extent allocation tree */
struct btrfs_workqueue * extent_workers ;
2008-06-26 00:01:31 +04:00
struct task_struct * transaction_kthread ;
struct task_struct * cleaner_kthread ;
2008-06-12 05:47:56 +04:00
int thread_pool_size ;
2008-06-12 00:50:36 +04:00
2013-11-01 21:07:04 +04:00
struct kobject * space_info_kobj ;
2007-03-20 21:38:32 +03:00
2007-11-16 22:57:08 +03:00
u64 total_pinned ;
2009-03-13 18:00:37 +03:00
2013-01-29 14:09:20 +04:00
/* used to keep from writing metadata until there is a nice batch */
struct percpu_counter dirty_metadata_bytes ;
2013-01-29 14:10:51 +04:00
struct percpu_counter delalloc_bytes ;
2013-01-29 14:09:20 +04:00
s32 dirty_metadata_batch ;
2013-01-29 14:10:51 +04:00
s32 delalloc_batch ;
2008-03-24 22:01:56 +03:00
struct list_head dirty_cowonly_roots ;
2008-03-24 22:02:07 +03:00
struct btrfs_fs_devices * fs_devices ;
2009-03-10 19:39:20 +03:00
/*
* the space_info list is almost entirely read only . It only changes
* when we add a new raid type to the FS , and that happens
* very rarely . RCU is used to protect it .
*/
2008-03-24 22:01:59 +03:00
struct list_head space_info ;
2009-03-10 19:39:20 +03:00
2012-07-10 06:21:07 +04:00
struct btrfs_space_info * data_sinfo ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
struct reloc_control * reloc_ctl ;
btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-28 09:31:28 +03:00
/* data_alloc_cluster is only used in ssd_spread mode */
2009-04-03 17:47:43 +04:00
struct btrfs_free_cluster data_alloc_cluster ;
/* all metadata allocations go through this cluster */
struct btrfs_free_cluster meta_alloc_cluster ;
2008-04-04 23:40:00 +04:00
2011-05-24 23:35:30 +04:00
/* auto defrag inodes go here */
spinlock_t defrag_inodes_lock ;
struct rb_root defrag_inodes ;
atomic_t defrag_running ;
2013-01-29 14:13:12 +04:00
/* Used to protect avail_{data, metadata, system}_alloc_bits */
seqlock_t profiles_lock ;
2012-01-17 00:04:47 +04:00
/*
* these three are in extended format ( availability of single
* chunks is denoted by BTRFS_AVAIL_ALLOC_BIT_SINGLE bit , other
* types are denoted by corresponding BTRFS_BLOCK_GROUP_ * bits )
*/
2008-04-04 23:40:00 +04:00
u64 avail_data_alloc_bits ;
u64 avail_metadata_alloc_bits ;
u64 avail_system_alloc_bits ;
2008-04-28 23:29:42 +04:00
2012-01-17 00:04:47 +04:00
/* restriper state */
spinlock_t balance_lock ;
struct mutex balance_mutex ;
2012-01-17 00:04:49 +04:00
atomic_t balance_running ;
atomic_t balance_pause_req ;
2012-01-17 00:04:49 +04:00
atomic_t balance_cancel_req ;
2012-01-17 00:04:47 +04:00
struct btrfs_balance_control * balance_ctl ;
2012-01-17 00:04:49 +04:00
wait_queue_head_t balance_wait_q ;
2012-01-17 00:04:47 +04:00
2009-04-22 01:40:57 +04:00
unsigned data_chunk_allocations ;
unsigned metadata_ratio ;
2008-04-28 23:29:42 +04:00
void * bdev_holder ;
2011-01-06 14:30:25 +03:00
2011-03-08 16:14:00 +03:00
/* private scrub information */
struct mutex scrub_lock ;
atomic_t scrubs_running ;
atomic_t scrub_pause_req ;
atomic_t scrubs_paused ;
atomic_t scrub_cancel_req ;
wait_queue_head_t scrub_pause_wait ;
int scrub_workers_refcnt ;
2014-02-28 06:46:19 +04:00
struct btrfs_workqueue * scrub_workers ;
struct btrfs_workqueue * scrub_wr_completion_workers ;
struct btrfs_workqueue * scrub_nocow_workers ;
2015-06-04 15:09:15 +03:00
struct btrfs_workqueue * scrub_parity_workers ;
2011-03-08 16:14:00 +03:00
2011-11-09 16:44:05 +04:00
# ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
u32 check_integrity_print_mask ;
# endif
2011-09-13 14:56:09 +04:00
/* is qgroup tracking in a consistent state? */
u64 qgroup_flags ;
/* holds configuration and tracking. Protected by qgroup_lock */
struct rb_root qgroup_tree ;
2014-05-14 04:30:47 +04:00
struct rb_root qgroup_op_tree ;
2011-09-13 14:56:09 +04:00
spinlock_t qgroup_lock ;
2014-05-14 04:30:47 +04:00
spinlock_t qgroup_op_lock ;
atomic_t qgroup_op_seq ;
2011-09-13 14:56:09 +04:00
2013-05-06 15:03:27 +04:00
/*
* used to avoid frequently calling ulist_alloc ( ) / ulist_free ( )
* when doing qgroup accounting , it must be protected by qgroup_lock .
*/
struct ulist * qgroup_ulist ;
2013-04-07 14:50:16 +04:00
/* protect user change for quota operations */
struct mutex qgroup_ioctl_lock ;
2011-09-13 14:56:09 +04:00
/* list of dirty qgroups to be written at next commit */
struct list_head dirty_qgroups ;
2015-04-17 05:23:16 +03:00
/* used by qgroup for an efficient tree traversal */
2011-09-13 14:56:09 +04:00
u64 qgroup_seq ;
2011-11-09 16:44:05 +04:00
2013-04-25 20:04:51 +04:00
/* qgroup rescan items */
struct mutex qgroup_rescan_lock ; /* protects the progress item */
struct btrfs_key qgroup_rescan_progress ;
2014-02-28 06:46:19 +04:00
struct btrfs_workqueue * qgroup_rescan_workers ;
2013-05-06 23:14:17 +04:00
struct completion qgroup_rescan_completion ;
Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-28 19:47:24 +04:00
struct btrfs_work qgroup_rescan_work ;
2016-08-15 19:10:33 +03:00
bool qgroup_rescan_running ; /* protected by qgroup_rescan_lock */
2013-04-25 20:04:51 +04:00
2011-01-06 14:30:25 +03:00
/* filesystem state */
2013-01-29 14:14:48 +04:00
unsigned long fs_state ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
struct btrfs_delayed_root * delayed_root ;
2011-11-03 23:17:42 +04:00
2011-05-23 16:30:00 +04:00
/* readahead tree */
spinlock_t reada_lock ;
struct radix_tree_root reada_tree ;
2011-11-06 12:05:08 +04:00
2016-01-07 13:38:48 +03:00
/* readahead works cnt */
atomic_t reada_works_cnt ;
2013-12-16 22:24:27 +04:00
/* Extent buffer radix tree */
spinlock_t buffer_lock ;
struct radix_tree_root buffer_radix ;
2011-11-03 23:17:42 +04:00
/* next backup root to be overwritten */
int backup_root_index ;
2012-08-01 20:56:49 +04:00
2012-11-05 20:26:40 +04:00
/* device replace state */
struct btrfs_dev_replace dev_replace ;
2012-11-05 20:54:08 +04:00
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 12:46:55 +04:00
struct percpu_counter bio_counter ;
wait_queue_head_t replace_wait ;
2013-08-15 19:11:21 +04:00
struct semaphore uuid_tree_rescan_sem ;
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 04:29:04 +04:00
/* Used to reclaim the metadata space in the background. */
struct work_struct async_reclaim_work ;
2014-09-18 19:20:02 +04:00
spinlock_t unused_bgs_lock ;
struct list_head unused_bgs ;
2015-01-29 22:18:25 +03:00
struct mutex unused_bg_unpin_mutex ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 02:58:53 +03:00
struct mutex delete_unused_bgs_mutex ;
2014-09-23 09:40:08 +04:00
/* For btrfs to record security options */
struct security_mnt_opts security_opts ;
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 00:14:15 +03:00
/*
* Chunks that can ' t be freed yet ( under a trim / discard operation )
* and will be latter freed . Protected by fs_info - > chunk_mutex .
*/
struct list_head pinned_chunks ;
2015-12-30 18:52:35 +03:00
2016-06-15 16:22:56 +03:00
/* Cached block sizes */
u32 nodesize ;
u32 sectorsize ;
u32 stripesize ;
2017-09-29 22:43:50 +03:00
# ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock ;
struct rb_root block_tree ;
# endif
2007-11-16 22:57:08 +03:00
} ;
2008-03-24 22:01:56 +03:00
2016-06-15 16:22:56 +03:00
static inline struct btrfs_fs_info * btrfs_sb ( struct super_block * sb )
{
return sb - > s_fs_info ;
}
2014-03-06 09:38:19 +04:00
struct btrfs_subvolume_writers {
struct percpu_counter counter ;
wait_queue_head_t wait ;
} ;
2014-04-02 15:51:05 +04:00
/*
* The state of btrfs root
*/
/*
* btrfs_record_root_in_trans is a multi - step process ,
* and it can race with the balancing code . But the
* race is very small , and only the first time the root
* is added to each transaction . So IN_TRANS_SETUP
* is used to tell us when more checks are required
*/
# define BTRFS_ROOT_IN_TRANS_SETUP 0
# define BTRFS_ROOT_REF_COWS 1
# define BTRFS_ROOT_TRACK_DIRTY 2
# define BTRFS_ROOT_IN_RADIX 3
2016-06-21 16:52:41 +03:00
# define BTRFS_ROOT_ORPHAN_ITEM_INSERTED 4
# define BTRFS_ROOT_DEFRAG_RUNNING 5
# define BTRFS_ROOT_FORCE_COW 6
# define BTRFS_ROOT_MULTI_LOG_TASKS 7
# define BTRFS_ROOT_DIRTY 8
2014-04-02 15:51:05 +04:00
2007-03-20 21:38:32 +03:00
/*
* in ram representation of the tree . extent_root is used for all allocations
2007-04-25 23:52:25 +04:00
* and for the extent tree extent_root root .
2007-03-20 21:38:32 +03:00
*/
struct btrfs_root {
2007-10-16 00:14:19 +04:00
struct extent_buffer * node ;
2008-06-26 00:01:30 +04:00
2007-10-16 00:14:19 +04:00
struct extent_buffer * commit_root ;
2008-09-06 00:13:11 +04:00
struct btrfs_root * log_root ;
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 18:09:34 +04:00
struct btrfs_root * reloc_root ;
2008-07-28 23:32:19 +04:00
2014-04-02 15:51:05 +04:00
unsigned long state ;
2007-03-15 19:56:47 +03:00
struct btrfs_root_item root_item ;
struct btrfs_key root_key ;
2007-03-20 21:38:32 +03:00
struct btrfs_fs_info * fs_info ;
2008-09-12 00:17:57 +04:00
struct extent_io_tree dirty_log_pages ;
2008-06-26 00:01:30 +04:00
struct mutex objectid_mutex ;
2009-01-21 20:54:03 +03:00
2010-05-16 18:46:25 +04:00
spinlock_t accounting_lock ;
struct btrfs_block_rsv * block_rsv ;
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 06:06:11 +04:00
/* free ino cache stuff */
struct btrfs_free_space_ctl * free_ino_ctl ;
2014-02-05 05:37:48 +04:00
enum btrfs_caching_type ino_cache_state ;
spinlock_t ino_cache_lock ;
wait_queue_head_t ino_cache_wait ;
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 06:06:11 +04:00
struct btrfs_free_space_ctl * free_ino_pinned ;
2014-02-05 05:37:48 +04:00
u64 ino_cache_progress ;
struct inode * ino_cache_inode ;
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 06:06:11 +04:00
2008-09-06 00:13:11 +04:00
struct mutex log_mutex ;
2009-01-21 20:54:03 +03:00
wait_queue_head_t log_writer_wait ;
wait_queue_head_t log_commit_wait [ 2 ] ;
2014-02-20 14:08:58 +04:00
struct list_head log_ctxs [ 2 ] ;
2009-01-21 20:54:03 +03:00
atomic_t log_writers ;
atomic_t log_commit [ 2 ] ;
2012-09-06 14:04:27 +04:00
atomic_t log_batch ;
2014-02-20 14:08:56 +04:00
int log_transid ;
2014-02-20 14:08:59 +04:00
/* No matter the commit succeeds or not*/
int log_transid_committed ;
/* Just be updated when the commit succeeds. */
2014-02-20 14:08:56 +04:00
int last_log_commit ;
2009-10-08 23:30:04 +04:00
pid_t log_start_pid ;
2008-08-05 07:17:27 +04:00
2007-04-09 18:42:37 +04:00
u64 objectid ;
u64 last_trans ;
2007-10-16 00:14:19 +04:00
2007-03-20 21:38:32 +03:00
u32 type ;
2009-09-21 23:56:00 +04:00
u64 highest_objectid ;
2011-06-14 04:00:16 +04:00
2016-07-15 16:23:37 +03:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
2014-10-08 00:24:20 +04:00
/* only used with CONFIG_BTRFS_FS_RUN_SANITY_TESTS is enabled */
2014-05-08 01:06:09 +04:00
u64 alloc_bytenr ;
2016-07-15 16:23:37 +03:00
# endif
2014-05-08 01:06:09 +04:00
2008-06-26 00:01:31 +04:00
u64 defrag_trans_start ;
2007-08-08 00:15:09 +04:00
struct btrfs_key defrag_progress ;
2008-05-24 22:04:53 +04:00
struct btrfs_key defrag_max ;
2007-08-29 23:47:34 +04:00
char * name ;
2008-03-24 22:01:56 +03:00
/* the dirty list is only used by non-reference counted roots */
struct list_head dirty_list ;
2008-07-24 20:17:14 +04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
struct list_head root_list ;
2012-10-12 23:27:49 +04:00
spinlock_t log_extents_lock [ 2 ] ;
struct list_head logged_list [ 2 ] ;
2010-05-16 18:49:58 +04:00
spinlock_t orphan_lock ;
2012-05-23 22:26:42 +04:00
atomic_t orphan_inodes ;
2010-05-16 18:49:58 +04:00
struct btrfs_block_rsv * orphan_block_rsv ;
int orphan_cleanup_state ;
2008-11-18 04:42:26 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
spinlock_t inode_lock ;
/* red-black tree that keeps track of in-memory inodes */
struct rb_root inode_tree ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
/*
* radix tree that keeps track of delayed nodes of every inode ,
* protected by inode_lock
*/
struct radix_tree_root delayed_nodes_tree ;
2008-11-18 04:42:26 +03:00
/*
* right now this just gets used so that a root has its own devid
* for stat . It may be used for more later
*/
2011-07-07 23:44:25 +04:00
dev_t anon_dev ;
2011-11-15 05:48:06 +04:00
2012-12-07 13:28:54 +04:00
spinlock_t root_item_lock ;
2017-03-03 11:55:18 +03:00
refcount_t refs ;
2013-05-15 11:48:22 +04:00
2014-03-06 09:55:03 +04:00
struct mutex delalloc_mutex ;
2013-05-15 11:48:22 +04:00
spinlock_t delalloc_lock ;
/*
* all of the inodes that have delalloc bytes . It is possible for
* this list to be empty even when there is still dirty data = ordered
* extents waiting to finish IO .
*/
struct list_head delalloc_inodes ;
struct list_head delalloc_root ;
u64 nr_delalloc_inodes ;
2014-03-06 09:55:02 +04:00
struct mutex ordered_extent_mutex ;
2013-05-15 11:48:23 +04:00
/*
* this is used by the balancing code to wait for all the pending
* ordered extents
*/
spinlock_t ordered_extent_lock ;
/*
* all of the data = ordered extents pending writeback
* these can span multiple transactions and basically include
* every dirty data page that isn ' t from nodatacow
*/
struct list_head ordered_extents ;
struct list_head ordered_root ;
u64 nr_ordered_extents ;
2013-12-16 20:34:17 +04:00
/*
* Number of currently running SEND ioctls to prevent
* manipulation with the read - only status via SUBVOL_SETFLAGS
*/
int send_in_progress ;
2014-03-06 09:38:19 +04:00
struct btrfs_subvolume_writers * subv_writers ;
2017-06-22 03:19:11 +03:00
atomic_t will_be_snapshotted ;
2015-09-08 12:08:38 +03:00
/* For qgroup metadata space reserve */
2017-03-14 13:25:09 +03:00
atomic64_t qgroup_meta_rsv ;
2007-03-15 19:56:47 +03:00
} ;
2017-05-22 13:16:11 +03:00
2017-07-24 22:14:25 +03:00
struct btrfs_file_private {
struct btrfs_trans_handle * trans ;
void * filldir_buf ;
} ;
2016-06-15 16:22:56 +03:00
static inline u32 btrfs_inode_sectorsize ( const struct inode * inode )
{
return btrfs_sb ( inode - > i_sb ) - > sectorsize ;
}
2007-03-15 19:56:47 +03:00
2016-06-15 16:22:56 +03:00
static inline u32 BTRFS_LEAF_DATA_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 17:33:06 +03:00
{
2017-05-22 13:16:11 +03:00
return info - > nodesize - sizeof ( struct btrfs_header ) ;
2016-06-15 17:33:06 +03:00
}
2017-05-29 09:43:43 +03:00
# define BTRFS_LEAF_DATA_OFFSET offsetof(struct btrfs_leaf, items)
2016-06-15 16:22:56 +03:00
static inline u32 BTRFS_MAX_ITEM_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 17:33:06 +03:00
{
2016-06-15 16:22:56 +03:00
return BTRFS_LEAF_DATA_SIZE ( info ) - sizeof ( struct btrfs_item ) ;
2016-06-15 17:33:06 +03:00
}
2016-06-15 16:22:56 +03:00
static inline u32 BTRFS_NODEPTRS_PER_BLOCK ( const struct btrfs_fs_info * info )
2016-06-15 17:33:06 +03:00
{
2016-06-15 16:22:56 +03:00
return BTRFS_LEAF_DATA_SIZE ( info ) / sizeof ( struct btrfs_key_ptr ) ;
2016-06-15 17:33:06 +03:00
}
# define BTRFS_FILE_EXTENT_INLINE_DATA_START \
( offsetof ( struct btrfs_file_extent_item , disk_bytenr ) )
2016-06-15 16:22:56 +03:00
static inline u32 BTRFS_MAX_INLINE_DATA_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 17:33:06 +03:00
{
2016-06-15 16:22:56 +03:00
return BTRFS_MAX_ITEM_SIZE ( info ) -
2016-06-15 17:33:06 +03:00
BTRFS_FILE_EXTENT_INLINE_DATA_START ;
}
2016-06-15 16:22:56 +03:00
static inline u32 BTRFS_MAX_XATTR_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 17:33:06 +03:00
{
2016-06-15 16:22:56 +03:00
return BTRFS_MAX_ITEM_SIZE ( info ) - sizeof ( struct btrfs_dir_item ) ;
2016-06-15 17:33:06 +03:00
}
2011-06-28 19:10:37 +04:00
/*
* Flags for mount options .
*
* Note : don ' t forget to add new options to btrfs_show_options ( )
*/
2008-01-09 17:23:21 +03:00
# define BTRFS_MOUNT_NODATASUM (1 << 0)
# define BTRFS_MOUNT_NODATACOW (1 << 1)
# define BTRFS_MOUNT_NOBARRIER (1 << 2)
2008-01-18 18:54:22 +03:00
# define BTRFS_MOUNT_SSD (1 << 3)
2008-05-13 21:46:40 +04:00
# define BTRFS_MOUNT_DEGRADED (1 << 4)
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
# define BTRFS_MOUNT_COMPRESS (1 << 5)
2009-04-03 00:49:40 +04:00
# define BTRFS_MOUNT_NOTREELOG (1 << 6)
2009-04-03 00:59:01 +04:00
# define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
2009-06-10 04:28:34 +04:00
# define BTRFS_MOUNT_SSD_SPREAD (1 << 8)
2009-06-10 17:51:32 +04:00
# define BTRFS_MOUNT_NOSSD (1 << 9)
2009-10-14 17:24:59 +04:00
# define BTRFS_MOUNT_DISCARD (1 << 10)
2010-01-29 00:18:15 +03:00
# define BTRFS_MOUNT_FORCE_COMPRESS (1 << 11)
2010-06-21 22:48:16 +04:00
# define BTRFS_MOUNT_SPACE_CACHE (1 << 12)
2010-09-21 22:21:34 +04:00
# define BTRFS_MOUNT_CLEAR_CACHE (1 << 13)
2010-10-29 23:46:43 +04:00
# define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 << 14)
2011-02-16 21:10:41 +03:00
# define BTRFS_MOUNT_ENOSPC_DEBUG (1 << 15)
2011-05-24 23:35:30 +04:00
# define BTRFS_MOUNT_AUTO_DEFRAG (1 << 16)
2011-06-03 17:36:29 +04:00
# define BTRFS_MOUNT_INODE_MAP_CACHE (1 << 17)
2016-01-19 05:23:02 +03:00
# define BTRFS_MOUNT_USEBACKUPROOT (1 << 18)
2012-01-17 00:04:48 +04:00
# define BTRFS_MOUNT_SKIP_BALANCE (1 << 19)
2012-01-17 00:27:58 +04:00
# define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20)
# define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
2011-10-04 07:22:31 +04:00
# define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
2013-08-15 19:11:24 +04:00
# define BTRFS_MOUNT_RESCAN_UUID_TREE (1 << 23)
2015-09-23 21:54:14 +03:00
# define BTRFS_MOUNT_FRAGMENT_DATA (1 << 24)
# define BTRFS_MOUNT_FRAGMENT_METADATA (1 << 25)
2015-12-18 22:11:10 +03:00
# define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26)
2016-01-19 05:23:03 +03:00
# define BTRFS_MOUNT_NOLOGREPLAY (1 << 27)
2017-09-29 22:43:48 +03:00
# define BTRFS_MOUNT_REF_VERIFY (1 << 28)
2007-12-14 23:30:32 +03:00
2013-08-01 20:14:52 +04:00
# define BTRFS_DEFAULT_COMMIT_INTERVAL (30)
2015-10-08 15:14:16 +03:00
# define BTRFS_DEFAULT_MAX_INLINE (2048)
2013-08-01 20:14:52 +04:00
2007-12-14 23:30:32 +03:00
# define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
# define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
2013-02-21 10:32:52 +04:00
# define btrfs_raw_test_opt(o, opt) ((o) & BTRFS_MOUNT_##opt)
2016-06-10 04:38:35 +03:00
# define btrfs_test_opt(fs_info, opt) ((fs_info)->mount_opt & \
2007-12-14 23:30:32 +03:00
BTRFS_MOUNT_ # # opt )
2014-02-05 18:26:17 +04:00
2016-06-10 04:38:35 +03:00
# define btrfs_set_and_info(fs_info, opt, fmt, args...) \
2014-04-23 15:33:33 +04:00
{ \
2016-06-10 04:38:35 +03:00
if ( ! btrfs_test_opt ( fs_info , opt ) ) \
btrfs_info ( fs_info , fmt , # # args ) ; \
btrfs_set_opt ( fs_info - > mount_opt , opt ) ; \
2014-04-23 15:33:33 +04:00
}
2016-06-10 04:38:35 +03:00
# define btrfs_clear_and_info(fs_info, opt, fmt, args...) \
2014-04-23 15:33:33 +04:00
{ \
2016-06-10 04:38:35 +03:00
if ( btrfs_test_opt ( fs_info , opt ) ) \
btrfs_info ( fs_info , fmt , # # args ) ; \
btrfs_clear_opt ( fs_info - > mount_opt , opt ) ; \
2014-04-23 15:33:33 +04:00
}
2015-09-23 21:54:14 +03:00
# ifdef CONFIG_BTRFS_DEBUG
static inline int
2016-06-23 01:54:24 +03:00
btrfs_should_fragment_free_space ( struct btrfs_block_group_cache * block_group )
2015-09-23 21:54:14 +03:00
{
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info = block_group - > fs_info ;
2016-06-23 01:54:23 +03:00
return ( btrfs_test_opt ( fs_info , FRAGMENT_METADATA ) & &
2015-09-23 21:54:14 +03:00
block_group - > flags & BTRFS_BLOCK_GROUP_METADATA ) | |
2016-06-23 01:54:23 +03:00
( btrfs_test_opt ( fs_info , FRAGMENT_DATA ) & &
2015-09-23 21:54:14 +03:00
block_group - > flags & BTRFS_BLOCK_GROUP_DATA ) ;
}
# endif
2014-02-05 18:26:17 +04:00
/*
* Requests for changes that need to be done during transaction commit .
*
* Internal mount options that are used for special handling of the real
* mount options ( eg . cannot be set during remount and have to be set during
* transaction commit )
*/
2014-02-05 18:26:17 +04:00
# define BTRFS_PENDING_SET_INODE_MAP_CACHE (0)
# define BTRFS_PENDING_CLEAR_INODE_MAP_CACHE (1)
2014-11-12 16:24:35 +03:00
# define BTRFS_PENDING_COMMIT (2)
2014-02-05 18:26:17 +04:00
2014-02-05 18:26:17 +04:00
# define btrfs_test_pending(info, opt) \
test_bit ( BTRFS_PENDING_ # # opt , & ( info ) - > pending_changes )
# define btrfs_set_pending(info, opt) \
set_bit ( BTRFS_PENDING_ # # opt , & ( info ) - > pending_changes )
# define btrfs_clear_pending(info, opt) \
clear_bit ( BTRFS_PENDING_ # # opt , & ( info ) - > pending_changes )
/*
* Helpers for setting pending mount option changes .
*
* Expects corresponding macros
* BTRFS_PENDING_SET_ and CLEAR_ + short mount option name
*/
# define btrfs_set_pending_and_info(info, opt, fmt, args...) \
do { \
if ( ! btrfs_raw_test_opt ( ( info ) - > mount_opt , opt ) ) { \
btrfs_info ( ( info ) , fmt , # # args ) ; \
btrfs_set_pending ( ( info ) , SET_ # # opt ) ; \
btrfs_clear_pending ( ( info ) , CLEAR_ # # opt ) ; \
} \
} while ( 0 )
# define btrfs_clear_pending_and_info(info, opt, fmt, args...) \
do { \
if ( btrfs_raw_test_opt ( ( info ) - > mount_opt , opt ) ) { \
btrfs_info ( ( info ) , fmt , # # args ) ; \
btrfs_set_pending ( ( info ) , CLEAR_ # # opt ) ; \
btrfs_clear_pending ( ( info ) , SET_ # # opt ) ; \
} \
} while ( 0 )
2008-01-08 23:54:37 +03:00
/*
* Inode flags
*/
2008-01-14 21:26:08 +03:00
# define BTRFS_INODE_NODATASUM (1 << 0)
# define BTRFS_INODE_NODATACOW (1 << 1)
# define BTRFS_INODE_READONLY (1 << 2)
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
# define BTRFS_INODE_NOCOMPRESS (1 << 3)
2008-10-30 21:25:28 +03:00
# define BTRFS_INODE_PREALLOC (1 << 4)
2009-04-17 12:37:41 +04:00
# define BTRFS_INODE_SYNC (1 << 5)
# define BTRFS_INODE_IMMUTABLE (1 << 6)
# define BTRFS_INODE_APPEND (1 << 7)
# define BTRFS_INODE_NODUMP (1 << 8)
# define BTRFS_INODE_NOATIME (1 << 9)
# define BTRFS_INODE_DIRSYNC (1 << 10)
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 13:12:20 +03:00
# define BTRFS_INODE_COMPRESS (1 << 11)
2009-04-17 12:37:41 +04:00
2011-03-28 06:01:25 +04:00
# define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31)
2012-03-03 16:40:03 +04:00
struct btrfs_map_token {
2017-06-29 06:56:53 +03:00
const struct extent_buffer * eb ;
2012-03-03 16:40:03 +04:00
char * kaddr ;
unsigned long offset ;
} ;
2016-01-21 13:25:53 +03:00
# define BTRFS_BYTES_TO_BLKS(fs_info, bytes) \
( ( bytes ) > > ( fs_info ) - > sb - > s_blocksize_bits )
2012-03-03 16:40:03 +04:00
static inline void btrfs_init_map_token ( struct btrfs_map_token * token )
{
2012-10-15 21:39:33 +04:00
token - > kaddr = NULL ;
2012-03-03 16:40:03 +04:00
}
2016-05-20 04:18:45 +03:00
/* some macros to generate set/get functions for the struct fields. This
2007-10-16 00:14:19 +04:00
* assumes there is a lefoo_to_cpu for every type , so lets make a simple
* one for u8 :
*/
# define le8_to_cpu(v) (v)
# define cpu_to_le8(v) (v)
# define __le8 u8
2016-09-20 17:05:01 +03:00
# define read_eb_member(eb, ptr, type, member, result) (\
2007-10-16 00:14:19 +04:00
read_extent_buffer ( eb , ( char * ) ( result ) , \
( ( unsigned long ) ( ptr ) ) + \
offsetof ( type , member ) , \
sizeof ( ( ( type * ) 0 ) - > member ) ) )
2016-09-20 17:05:01 +03:00
# define write_eb_member(eb, ptr, type, member, result) (\
2007-10-16 00:14:19 +04:00
write_extent_buffer ( eb , ( char * ) ( result ) , \
( ( unsigned long ) ( ptr ) ) + \
offsetof ( type , member ) , \
sizeof ( ( ( type * ) 0 ) - > member ) ) )
2012-07-10 06:22:35 +04:00
# define DECLARE_BTRFS_SETGET_BITS(bits) \
2017-06-29 06:56:53 +03:00
u # # bits btrfs_get_token_ # # bits ( const struct extent_buffer * eb , \
const void * ptr , unsigned long off , \
struct btrfs_map_token * token ) ; \
void btrfs_set_token_ # # bits ( struct extent_buffer * eb , const void * ptr , \
2012-07-10 06:22:35 +04:00
unsigned long off , u # # bits val , \
struct btrfs_map_token * token ) ; \
2017-06-29 06:56:53 +03:00
static inline u # # bits btrfs_get_ # # bits ( const struct extent_buffer * eb , \
const void * ptr , \
2012-07-10 06:22:35 +04:00
unsigned long off ) \
{ \
return btrfs_get_token_ # # bits ( eb , ptr , off , NULL ) ; \
} \
2017-06-29 06:56:53 +03:00
static inline void btrfs_set_ # # bits ( struct extent_buffer * eb , void * ptr , \
2012-07-10 06:22:35 +04:00
unsigned long off , u # # bits val ) \
{ \
btrfs_set_token_ # # bits ( eb , ptr , off , val , NULL ) ; \
}
DECLARE_BTRFS_SETGET_BITS ( 8 )
DECLARE_BTRFS_SETGET_BITS ( 16 )
DECLARE_BTRFS_SETGET_BITS ( 32 )
DECLARE_BTRFS_SETGET_BITS ( 64 )
2007-10-16 00:14:19 +04:00
# define BTRFS_SETGET_FUNCS(name, type, member, bits) \
2017-06-29 06:56:53 +03:00
static inline u # # bits btrfs_ # # name ( const struct extent_buffer * eb , \
const type * s ) \
2012-07-10 06:22:35 +04:00
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
return btrfs_get_ # # bits ( eb , s , offsetof ( type , member ) ) ; \
} \
static inline void btrfs_set_ # # name ( struct extent_buffer * eb , type * s , \
u # # bits val ) \
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
btrfs_set_ # # bits ( eb , s , offsetof ( type , member ) , val ) ; \
} \
2017-06-29 06:56:53 +03:00
static inline u # # bits btrfs_token_ # # name ( const struct extent_buffer * eb , \
const type * s , \
2012-07-10 06:22:35 +04:00
struct btrfs_map_token * token ) \
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
return btrfs_get_token_ # # bits ( eb , s , offsetof ( type , member ) , token ) ; \
} \
static inline void btrfs_set_token_ # # name ( struct extent_buffer * eb , \
type * s , u # # bits val , \
struct btrfs_map_token * token ) \
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
btrfs_set_token_ # # bits ( eb , s , offsetof ( type , member ) , val , token ) ; \
}
2007-10-16 00:14:19 +04:00
# define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \
2017-06-29 06:56:53 +03:00
static inline u # # bits btrfs_ # # name ( const struct extent_buffer * eb ) \
2007-10-16 00:14:19 +04:00
{ \
2017-06-29 06:56:53 +03:00
const type * p = page_address ( eb - > pages [ 0 ] ) ; \
2008-02-15 18:40:52 +03:00
u # # bits res = le # # bits # # _to_cpu ( p - > member ) ; \
2007-10-16 00:18:55 +04:00
return res ; \
2007-10-16 00:14:19 +04:00
} \
static inline void btrfs_set_ # # name ( struct extent_buffer * eb , \
u # # bits val ) \
{ \
2010-08-06 21:21:20 +04:00
type * p = page_address ( eb - > pages [ 0 ] ) ; \
2008-02-15 18:40:52 +03:00
p - > member = cpu_to_le # # bits ( val ) ; \
2007-10-16 00:14:19 +04:00
}
2007-04-27 00:46:15 +04:00
2007-10-16 00:14:19 +04:00
# define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \
2017-06-29 06:56:53 +03:00
static inline u # # bits btrfs_ # # name ( const type * s ) \
2007-10-16 00:14:19 +04:00
{ \
return le # # bits # # _to_cpu ( s - > member ) ; \
} \
static inline void btrfs_set_ # # name ( type * s , u # # bits val ) \
{ \
s - > member = cpu_to_le # # bits ( val ) ; \
2007-03-16 02:03:33 +03:00
}
2017-06-16 14:39:19 +03:00
static inline u64 btrfs_device_total_bytes ( struct extent_buffer * eb ,
struct btrfs_dev_item * s )
{
BUILD_BUG_ON ( sizeof ( u64 ) ! =
sizeof ( ( ( struct btrfs_dev_item * ) 0 ) ) - > total_bytes ) ;
return btrfs_get_64 ( eb , s , offsetof ( struct btrfs_dev_item ,
total_bytes ) ) ;
}
static inline void btrfs_set_device_total_bytes ( struct extent_buffer * eb ,
struct btrfs_dev_item * s ,
u64 val )
{
BUILD_BUG_ON ( sizeof ( u64 ) ! =
sizeof ( ( ( struct btrfs_dev_item * ) 0 ) ) - > total_bytes ) ;
2017-06-16 14:39:20 +03:00
WARN_ON ( ! IS_ALIGNED ( val , eb - > fs_info - > sectorsize ) ) ;
2017-06-16 14:39:19 +03:00
btrfs_set_64 ( eb , s , offsetof ( struct btrfs_dev_item , total_bytes ) , val ) ;
}
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( device_type , struct btrfs_dev_item , type , 64 ) ;
BTRFS_SETGET_FUNCS ( device_bytes_used , struct btrfs_dev_item , bytes_used , 64 ) ;
BTRFS_SETGET_FUNCS ( device_io_align , struct btrfs_dev_item , io_align , 32 ) ;
BTRFS_SETGET_FUNCS ( device_io_width , struct btrfs_dev_item , io_width , 32 ) ;
2008-12-09 00:40:21 +03:00
BTRFS_SETGET_FUNCS ( device_start_offset , struct btrfs_dev_item ,
start_offset , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( device_sector_size , struct btrfs_dev_item , sector_size , 32 ) ;
BTRFS_SETGET_FUNCS ( device_id , struct btrfs_dev_item , devid , 64 ) ;
2008-04-15 23:41:47 +04:00
BTRFS_SETGET_FUNCS ( device_group , struct btrfs_dev_item , dev_group , 32 ) ;
BTRFS_SETGET_FUNCS ( device_seek_speed , struct btrfs_dev_item , seek_speed , 8 ) ;
BTRFS_SETGET_FUNCS ( device_bandwidth , struct btrfs_dev_item , bandwidth , 8 ) ;
2008-11-18 05:11:30 +03:00
BTRFS_SETGET_FUNCS ( device_generation , struct btrfs_dev_item , generation , 64 ) ;
2008-03-24 22:01:56 +03:00
2008-03-24 22:02:07 +03:00
BTRFS_SETGET_STACK_FUNCS ( stack_device_type , struct btrfs_dev_item , type , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_total_bytes , struct btrfs_dev_item ,
total_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_bytes_used , struct btrfs_dev_item ,
bytes_used , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_io_align , struct btrfs_dev_item ,
io_align , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_io_width , struct btrfs_dev_item ,
io_width , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_sector_size , struct btrfs_dev_item ,
sector_size , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_id , struct btrfs_dev_item , devid , 64 ) ;
2008-04-15 23:41:47 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_device_group , struct btrfs_dev_item ,
dev_group , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_seek_speed , struct btrfs_dev_item ,
seek_speed , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_bandwidth , struct btrfs_dev_item ,
bandwidth , 8 ) ;
2008-11-18 05:11:30 +03:00
BTRFS_SETGET_STACK_FUNCS ( stack_device_generation , struct btrfs_dev_item ,
generation , 64 ) ;
2008-03-24 22:02:07 +03:00
2013-08-20 15:20:11 +04:00
static inline unsigned long btrfs_device_uuid ( struct btrfs_dev_item * d )
2008-03-24 22:01:56 +03:00
{
2013-08-20 15:20:11 +04:00
return ( unsigned long ) d + offsetof ( struct btrfs_dev_item , uuid ) ;
2008-03-24 22:01:56 +03:00
}
2013-08-20 15:20:12 +04:00
static inline unsigned long btrfs_device_fsid ( struct btrfs_dev_item * d )
2008-11-18 05:11:30 +03:00
{
2013-08-20 15:20:12 +04:00
return ( unsigned long ) d + offsetof ( struct btrfs_dev_item , fsid ) ;
2008-11-18 05:11:30 +03:00
}
2008-04-15 23:41:47 +04:00
BTRFS_SETGET_FUNCS ( chunk_length , struct btrfs_chunk , length , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( chunk_owner , struct btrfs_chunk , owner , 64 ) ;
BTRFS_SETGET_FUNCS ( chunk_stripe_len , struct btrfs_chunk , stripe_len , 64 ) ;
BTRFS_SETGET_FUNCS ( chunk_io_align , struct btrfs_chunk , io_align , 32 ) ;
BTRFS_SETGET_FUNCS ( chunk_io_width , struct btrfs_chunk , io_width , 32 ) ;
BTRFS_SETGET_FUNCS ( chunk_sector_size , struct btrfs_chunk , sector_size , 32 ) ;
BTRFS_SETGET_FUNCS ( chunk_type , struct btrfs_chunk , type , 64 ) ;
BTRFS_SETGET_FUNCS ( chunk_num_stripes , struct btrfs_chunk , num_stripes , 16 ) ;
2008-04-16 18:49:51 +04:00
BTRFS_SETGET_FUNCS ( chunk_sub_stripes , struct btrfs_chunk , sub_stripes , 16 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( stripe_devid , struct btrfs_stripe , devid , 64 ) ;
BTRFS_SETGET_FUNCS ( stripe_offset , struct btrfs_stripe , offset , 64 ) ;
2008-04-15 23:41:47 +04:00
static inline char * btrfs_stripe_dev_uuid ( struct btrfs_stripe * s )
{
return ( char * ) s + offsetof ( struct btrfs_stripe , dev_uuid ) ;
}
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_length , struct btrfs_chunk , length , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_owner , struct btrfs_chunk , owner , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_stripe_len , struct btrfs_chunk ,
stripe_len , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_io_align , struct btrfs_chunk ,
io_align , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_io_width , struct btrfs_chunk ,
io_width , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_sector_size , struct btrfs_chunk ,
sector_size , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_type , struct btrfs_chunk , type , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_num_stripes , struct btrfs_chunk ,
num_stripes , 16 ) ;
2008-04-16 18:49:51 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_sub_stripes , struct btrfs_chunk ,
sub_stripes , 16 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_STACK_FUNCS ( stack_stripe_devid , struct btrfs_stripe , devid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_stripe_offset , struct btrfs_stripe , offset , 64 ) ;
static inline struct btrfs_stripe * btrfs_stripe_nr ( struct btrfs_chunk * c ,
int nr )
{
unsigned long offset = ( unsigned long ) c ;
offset + = offsetof ( struct btrfs_chunk , stripe ) ;
offset + = nr * sizeof ( struct btrfs_stripe ) ;
return ( struct btrfs_stripe * ) offset ;
}
2008-04-18 18:29:38 +04:00
static inline char * btrfs_stripe_dev_uuid_nr ( struct btrfs_chunk * c , int nr )
{
return btrfs_stripe_dev_uuid ( btrfs_stripe_nr ( c , nr ) ) ;
}
2008-03-24 22:01:56 +03:00
static inline u64 btrfs_stripe_offset_nr ( struct extent_buffer * eb ,
struct btrfs_chunk * c , int nr )
{
return btrfs_stripe_offset ( eb , btrfs_stripe_nr ( c , nr ) ) ;
}
static inline u64 btrfs_stripe_devid_nr ( struct extent_buffer * eb ,
struct btrfs_chunk * c , int nr )
{
return btrfs_stripe_devid ( eb , btrfs_stripe_nr ( c , nr ) ) ;
}
2007-10-16 00:14:19 +04:00
/* struct btrfs_block_group_item */
BTRFS_SETGET_STACK_FUNCS ( block_group_used , struct btrfs_block_group_item ,
used , 64 ) ;
BTRFS_SETGET_FUNCS ( disk_block_group_used , struct btrfs_block_group_item ,
used , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_STACK_FUNCS ( block_group_chunk_objectid ,
struct btrfs_block_group_item , chunk_objectid , 64 ) ;
2008-04-15 23:41:47 +04:00
BTRFS_SETGET_FUNCS ( disk_block_group_chunk_objectid ,
2008-03-24 22:01:56 +03:00
struct btrfs_block_group_item , chunk_objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( disk_block_group_flags ,
struct btrfs_block_group_item , flags , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( block_group_flags ,
struct btrfs_block_group_item , flags , 64 ) ;
2007-03-16 02:03:33 +03:00
2015-09-30 06:50:34 +03:00
/* struct btrfs_free_space_info */
BTRFS_SETGET_FUNCS ( free_space_extent_count , struct btrfs_free_space_info ,
extent_count , 32 ) ;
BTRFS_SETGET_FUNCS ( free_space_flags , struct btrfs_free_space_info , flags , 32 ) ;
2007-12-12 22:38:19 +03:00
/* struct btrfs_inode_ref */
BTRFS_SETGET_FUNCS ( inode_ref_name_len , struct btrfs_inode_ref , name_len , 16 ) ;
2008-07-24 20:12:38 +04:00
BTRFS_SETGET_FUNCS ( inode_ref_index , struct btrfs_inode_ref , index , 64 ) ;
2007-12-12 22:38:19 +03:00
2012-08-08 22:32:27 +04:00
/* struct btrfs_inode_extref */
BTRFS_SETGET_FUNCS ( inode_extref_parent , struct btrfs_inode_extref ,
parent_objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( inode_extref_name_len , struct btrfs_inode_extref ,
name_len , 16 ) ;
BTRFS_SETGET_FUNCS ( inode_extref_index , struct btrfs_inode_extref , index , 64 ) ;
2007-10-16 00:14:19 +04:00
/* struct btrfs_inode_item */
BTRFS_SETGET_FUNCS ( inode_generation , struct btrfs_inode_item , generation , 64 ) ;
2008-12-09 00:40:21 +03:00
BTRFS_SETGET_FUNCS ( inode_sequence , struct btrfs_inode_item , sequence , 64 ) ;
2008-09-06 00:13:11 +04:00
BTRFS_SETGET_FUNCS ( inode_transid , struct btrfs_inode_item , transid , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_FUNCS ( inode_size , struct btrfs_inode_item , size , 64 ) ;
2008-10-09 19:46:29 +04:00
BTRFS_SETGET_FUNCS ( inode_nbytes , struct btrfs_inode_item , nbytes , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_FUNCS ( inode_block_group , struct btrfs_inode_item , block_group , 64 ) ;
BTRFS_SETGET_FUNCS ( inode_nlink , struct btrfs_inode_item , nlink , 32 ) ;
BTRFS_SETGET_FUNCS ( inode_uid , struct btrfs_inode_item , uid , 32 ) ;
BTRFS_SETGET_FUNCS ( inode_gid , struct btrfs_inode_item , gid , 32 ) ;
BTRFS_SETGET_FUNCS ( inode_mode , struct btrfs_inode_item , mode , 32 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( inode_rdev , struct btrfs_inode_item , rdev , 64 ) ;
2008-12-02 14:36:08 +03:00
BTRFS_SETGET_FUNCS ( inode_flags , struct btrfs_inode_item , flags , 64 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_inode_generation , struct btrfs_inode_item ,
generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_sequence , struct btrfs_inode_item ,
sequence , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_transid , struct btrfs_inode_item ,
transid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_size , struct btrfs_inode_item , size , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_nbytes , struct btrfs_inode_item ,
nbytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_block_group , struct btrfs_inode_item ,
block_group , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_nlink , struct btrfs_inode_item , nlink , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_uid , struct btrfs_inode_item , uid , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_gid , struct btrfs_inode_item , gid , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_mode , struct btrfs_inode_item , mode , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_rdev , struct btrfs_inode_item , rdev , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_flags , struct btrfs_inode_item , flags , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( timespec_sec , struct btrfs_timespec , sec , 64 ) ;
BTRFS_SETGET_FUNCS ( timespec_nsec , struct btrfs_timespec , nsec , 32 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_timespec_sec , struct btrfs_timespec , sec , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_timespec_nsec , struct btrfs_timespec , nsec , 32 ) ;
2007-03-22 19:13:20 +03:00
2008-03-24 22:01:56 +03:00
/* struct btrfs_dev_extent */
2008-04-15 23:41:47 +04:00
BTRFS_SETGET_FUNCS ( dev_extent_chunk_tree , struct btrfs_dev_extent ,
chunk_tree , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_extent_chunk_objectid , struct btrfs_dev_extent ,
chunk_objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_extent_chunk_offset , struct btrfs_dev_extent ,
chunk_offset , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_FUNCS ( dev_extent_length , struct btrfs_dev_extent , length , 64 ) ;
2013-08-20 15:20:13 +04:00
static inline unsigned long btrfs_dev_extent_chunk_tree_uuid ( struct btrfs_dev_extent * dev )
2008-04-15 23:41:47 +04:00
{
unsigned long ptr = offsetof ( struct btrfs_dev_extent , chunk_tree_uuid ) ;
2013-08-20 15:20:13 +04:00
return ( unsigned long ) dev + ptr ;
2008-04-15 23:41:47 +04:00
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
BTRFS_SETGET_FUNCS ( extent_refs , struct btrfs_extent_item , refs , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_generation , struct btrfs_extent_item ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_flags , struct btrfs_extent_item , flags , 64 ) ;
2007-12-11 17:25:06 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
BTRFS_SETGET_FUNCS ( extent_refs_v0 , struct btrfs_extent_item_v0 , refs , 32 ) ;
BTRFS_SETGET_FUNCS ( tree_block_level , struct btrfs_tree_block_info , level , 8 ) ;
static inline void btrfs_tree_block_key ( struct extent_buffer * eb ,
struct btrfs_tree_block_info * item ,
struct btrfs_disk_key * key )
{
read_eb_member ( eb , item , struct btrfs_tree_block_info , key , key ) ;
}
static inline void btrfs_set_tree_block_key ( struct extent_buffer * eb ,
struct btrfs_tree_block_info * item ,
struct btrfs_disk_key * key )
{
write_eb_member ( eb , item , struct btrfs_tree_block_info , key , key ) ;
}
2007-03-22 19:13:20 +03:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
BTRFS_SETGET_FUNCS ( extent_data_ref_root , struct btrfs_extent_data_ref ,
root , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_data_ref_objectid , struct btrfs_extent_data_ref ,
objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_data_ref_offset , struct btrfs_extent_data_ref ,
offset , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_data_ref_count , struct btrfs_extent_data_ref ,
count , 32 ) ;
BTRFS_SETGET_FUNCS ( shared_data_ref_count , struct btrfs_shared_data_ref ,
count , 32 ) ;
BTRFS_SETGET_FUNCS ( extent_inline_ref_type , struct btrfs_extent_inline_ref ,
type , 8 ) ;
BTRFS_SETGET_FUNCS ( extent_inline_ref_offset , struct btrfs_extent_inline_ref ,
offset , 64 ) ;
static inline u32 btrfs_extent_inline_ref_size ( int type )
{
if ( type = = BTRFS_TREE_BLOCK_REF_KEY | |
type = = BTRFS_SHARED_BLOCK_REF_KEY )
return sizeof ( struct btrfs_extent_inline_ref ) ;
if ( type = = BTRFS_SHARED_DATA_REF_KEY )
return sizeof ( struct btrfs_shared_data_ref ) +
sizeof ( struct btrfs_extent_inline_ref ) ;
if ( type = = BTRFS_EXTENT_DATA_REF_KEY )
return sizeof ( struct btrfs_extent_data_ref ) +
offsetof ( struct btrfs_extent_inline_ref , offset ) ;
return 0 ;
}
BTRFS_SETGET_FUNCS ( ref_root_v0 , struct btrfs_extent_ref_v0 , root , 64 ) ;
BTRFS_SETGET_FUNCS ( ref_generation_v0 , struct btrfs_extent_ref_v0 ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( ref_objectid_v0 , struct btrfs_extent_ref_v0 , objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( ref_count_v0 , struct btrfs_extent_ref_v0 , count , 32 ) ;
2007-03-22 19:13:20 +03:00
2007-10-16 00:14:19 +04:00
/* struct btrfs_node */
BTRFS_SETGET_FUNCS ( key_blockptr , struct btrfs_key_ptr , blockptr , 64 ) ;
2007-12-11 17:25:06 +03:00
BTRFS_SETGET_FUNCS ( key_generation , struct btrfs_key_ptr , generation , 64 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_key_blockptr , struct btrfs_key_ptr ,
blockptr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_key_generation , struct btrfs_key_ptr ,
generation , 64 ) ;
2007-03-22 19:13:20 +03:00
2007-10-16 00:14:19 +04:00
static inline u64 btrfs_node_blockptr ( struct extent_buffer * eb , int nr )
2007-03-13 16:49:06 +03:00
{
2007-10-16 00:14:19 +04:00
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
return btrfs_key_blockptr ( eb , ( struct btrfs_key_ptr * ) ptr ) ;
2007-03-13 16:49:06 +03:00
}
2007-10-16 00:14:19 +04:00
static inline void btrfs_set_node_blockptr ( struct extent_buffer * eb ,
int nr , u64 val )
2007-03-13 16:49:06 +03:00
{
2007-10-16 00:14:19 +04:00
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
btrfs_set_key_blockptr ( eb , ( struct btrfs_key_ptr * ) ptr , val ) ;
2007-03-13 16:49:06 +03:00
}
2007-12-11 17:25:06 +03:00
static inline u64 btrfs_node_ptr_generation ( struct extent_buffer * eb , int nr )
{
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
return btrfs_key_generation ( eb , ( struct btrfs_key_ptr * ) ptr ) ;
}
static inline void btrfs_set_node_ptr_generation ( struct extent_buffer * eb ,
int nr , u64 val )
{
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
btrfs_set_key_generation ( eb , ( struct btrfs_key_ptr * ) ptr , val ) ;
}
2007-10-16 00:18:55 +04:00
static inline unsigned long btrfs_node_key_ptr_offset ( int nr )
2007-04-21 04:23:12 +04:00
{
2007-10-16 00:14:19 +04:00
return offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
2007-04-21 04:23:12 +04:00
}
2017-06-29 06:56:53 +03:00
void btrfs_node_key ( const struct extent_buffer * eb ,
2007-11-06 23:09:29 +03:00
struct btrfs_disk_key * disk_key , int nr ) ;
2007-10-16 00:14:19 +04:00
static inline void btrfs_set_node_key ( struct extent_buffer * eb ,
struct btrfs_disk_key * disk_key , int nr )
2007-03-13 16:28:32 +03:00
{
2007-10-16 00:14:19 +04:00
unsigned long ptr ;
ptr = btrfs_node_key_ptr_offset ( nr ) ;
write_eb_member ( eb , ( struct btrfs_key_ptr * ) ptr ,
struct btrfs_key_ptr , key , disk_key ) ;
2007-03-13 16:28:32 +03:00
}
2007-10-16 00:14:19 +04:00
/* struct btrfs_item */
BTRFS_SETGET_FUNCS ( item_offset , struct btrfs_item , offset , 32 ) ;
BTRFS_SETGET_FUNCS ( item_size , struct btrfs_item , size , 32 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_item_offset , struct btrfs_item , offset , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_item_size , struct btrfs_item , size , 32 ) ;
2007-04-21 04:23:12 +04:00
2007-10-16 00:14:19 +04:00
static inline unsigned long btrfs_item_nr_offset ( int nr )
2007-03-13 16:28:32 +03:00
{
2007-10-16 00:14:19 +04:00
return offsetof ( struct btrfs_leaf , items ) +
sizeof ( struct btrfs_item ) * nr ;
2007-03-13 16:28:32 +03:00
}
2013-09-16 18:58:09 +04:00
static inline struct btrfs_item * btrfs_item_nr ( int nr )
2007-03-13 03:12:07 +03:00
{
2007-10-16 00:14:19 +04:00
return ( struct btrfs_item * ) btrfs_item_nr_offset ( nr ) ;
2007-03-13 03:12:07 +03:00
}
2017-06-29 06:56:53 +03:00
static inline u32 btrfs_item_end ( const struct extent_buffer * eb ,
2007-10-16 00:14:19 +04:00
struct btrfs_item * item )
2007-03-13 03:12:07 +03:00
{
2007-10-16 00:14:19 +04:00
return btrfs_item_offset ( eb , item ) + btrfs_item_size ( eb , item ) ;
2007-03-13 03:12:07 +03:00
}
2017-06-29 06:56:53 +03:00
static inline u32 btrfs_item_end_nr ( const struct extent_buffer * eb , int nr )
2007-03-13 03:12:07 +03:00
{
2013-09-16 18:58:09 +04:00
return btrfs_item_end ( eb , btrfs_item_nr ( nr ) ) ;
2007-03-13 03:12:07 +03:00
}
2017-06-29 06:56:53 +03:00
static inline u32 btrfs_item_offset_nr ( const struct extent_buffer * eb , int nr )
2007-03-13 03:12:07 +03:00
{
2013-09-16 18:58:09 +04:00
return btrfs_item_offset ( eb , btrfs_item_nr ( nr ) ) ;
2007-03-13 03:12:07 +03:00
}
2017-06-29 06:56:53 +03:00
static inline u32 btrfs_item_size_nr ( const struct extent_buffer * eb , int nr )
2007-03-13 03:12:07 +03:00
{
2013-09-16 18:58:09 +04:00
return btrfs_item_size ( eb , btrfs_item_nr ( nr ) ) ;
2007-03-13 03:12:07 +03:00
}
2017-06-29 06:56:53 +03:00
static inline void btrfs_item_key ( const struct extent_buffer * eb ,
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key * disk_key , int nr )
2007-03-15 22:18:43 +03:00
{
2013-09-16 18:58:09 +04:00
struct btrfs_item * item = btrfs_item_nr ( nr ) ;
2007-10-16 00:14:19 +04:00
read_eb_member ( eb , item , struct btrfs_item , key , disk_key ) ;
2007-03-15 22:18:43 +03:00
}
2007-10-16 00:14:19 +04:00
static inline void btrfs_set_item_key ( struct extent_buffer * eb ,
struct btrfs_disk_key * disk_key , int nr )
2007-03-15 22:18:43 +03:00
{
2013-09-16 18:58:09 +04:00
struct btrfs_item * item = btrfs_item_nr ( nr ) ;
2007-10-16 00:14:19 +04:00
write_eb_member ( eb , item , struct btrfs_item , key , disk_key ) ;
2007-03-15 22:18:43 +03:00
}
2008-09-06 00:13:11 +04:00
BTRFS_SETGET_FUNCS ( dir_log_end , struct btrfs_dir_log_item , end , 64 ) ;
2008-11-18 04:37:39 +03:00
/*
* struct btrfs_root_ref
*/
BTRFS_SETGET_FUNCS ( root_ref_dirid , struct btrfs_root_ref , dirid , 64 ) ;
BTRFS_SETGET_FUNCS ( root_ref_sequence , struct btrfs_root_ref , sequence , 64 ) ;
BTRFS_SETGET_FUNCS ( root_ref_name_len , struct btrfs_root_ref , name_len , 16 ) ;
2007-10-16 00:14:19 +04:00
/* struct btrfs_dir_item */
2007-11-16 19:45:54 +03:00
BTRFS_SETGET_FUNCS ( dir_data_len , struct btrfs_dir_item , data_len , 16 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_FUNCS ( dir_type , struct btrfs_dir_item , type , 8 ) ;
BTRFS_SETGET_FUNCS ( dir_name_len , struct btrfs_dir_item , name_len , 16 ) ;
2008-09-06 00:13:11 +04:00
BTRFS_SETGET_FUNCS ( dir_transid , struct btrfs_dir_item , transid , 64 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_dir_type , struct btrfs_dir_item , type , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dir_data_len , struct btrfs_dir_item ,
data_len , 16 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dir_name_len , struct btrfs_dir_item ,
name_len , 16 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dir_transid , struct btrfs_dir_item ,
transid , 64 ) ;
2007-03-15 22:18:43 +03:00
2017-06-29 06:56:53 +03:00
static inline void btrfs_dir_item_key ( const struct extent_buffer * eb ,
const struct btrfs_dir_item * item ,
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key * key )
2007-03-15 22:18:43 +03:00
{
2007-10-16 00:14:19 +04:00
read_eb_member ( eb , item , struct btrfs_dir_item , location , key ) ;
2007-03-15 22:18:43 +03:00
}
2007-10-16 00:14:19 +04:00
static inline void btrfs_set_dir_item_key ( struct extent_buffer * eb ,
struct btrfs_dir_item * item ,
2017-06-29 06:56:53 +03:00
const struct btrfs_disk_key * key )
2007-03-16 15:46:49 +03:00
{
2007-10-16 00:14:19 +04:00
write_eb_member ( eb , item , struct btrfs_dir_item , location , key ) ;
2007-03-16 15:46:49 +03:00
}
2010-06-21 22:48:16 +04:00
BTRFS_SETGET_FUNCS ( free_space_entries , struct btrfs_free_space_header ,
num_entries , 64 ) ;
BTRFS_SETGET_FUNCS ( free_space_bitmaps , struct btrfs_free_space_header ,
num_bitmaps , 64 ) ;
BTRFS_SETGET_FUNCS ( free_space_generation , struct btrfs_free_space_header ,
generation , 64 ) ;
2017-06-29 06:56:53 +03:00
static inline void btrfs_free_space_key ( const struct extent_buffer * eb ,
const struct btrfs_free_space_header * h ,
2010-06-21 22:48:16 +04:00
struct btrfs_disk_key * key )
{
read_eb_member ( eb , h , struct btrfs_free_space_header , location , key ) ;
}
static inline void btrfs_set_free_space_key ( struct extent_buffer * eb ,
struct btrfs_free_space_header * h ,
2017-06-29 06:56:53 +03:00
const struct btrfs_disk_key * key )
2010-06-21 22:48:16 +04:00
{
write_eb_member ( eb , h , struct btrfs_free_space_header , location , key ) ;
}
2007-10-16 00:14:19 +04:00
/* struct btrfs_disk_key */
BTRFS_SETGET_STACK_FUNCS ( disk_key_objectid , struct btrfs_disk_key ,
objectid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( disk_key_offset , struct btrfs_disk_key , offset , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( disk_key_type , struct btrfs_disk_key , type , 8 ) ;
2007-03-15 22:18:43 +03:00
2007-03-12 23:22:34 +03:00
static inline void btrfs_disk_key_to_cpu ( struct btrfs_key * cpu ,
2017-01-18 10:24:37 +03:00
const struct btrfs_disk_key * disk )
2007-03-12 23:22:34 +03:00
{
cpu - > offset = le64_to_cpu ( disk - > offset ) ;
2007-10-16 00:14:19 +04:00
cpu - > type = disk - > type ;
2007-03-12 23:22:34 +03:00
cpu - > objectid = le64_to_cpu ( disk - > objectid ) ;
}
static inline void btrfs_cpu_key_to_disk ( struct btrfs_disk_key * disk ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * cpu )
2007-03-12 23:22:34 +03:00
{
disk - > offset = cpu_to_le64 ( cpu - > offset ) ;
2007-10-16 00:14:19 +04:00
disk - > type = cpu - > type ;
2007-03-12 23:22:34 +03:00
disk - > objectid = cpu_to_le64 ( cpu - > objectid ) ;
}
2017-06-29 06:56:53 +03:00
static inline void btrfs_node_key_to_cpu ( const struct extent_buffer * eb ,
struct btrfs_key * key , int nr )
2007-03-23 22:56:19 +03:00
{
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
btrfs_node_key ( eb , & disk_key , nr ) ;
btrfs_disk_key_to_cpu ( key , & disk_key ) ;
2007-03-23 22:56:19 +03:00
}
2017-06-29 06:56:53 +03:00
static inline void btrfs_item_key_to_cpu ( const struct extent_buffer * eb ,
struct btrfs_key * key , int nr )
2007-03-23 22:56:19 +03:00
{
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
btrfs_item_key ( eb , & disk_key , nr ) ;
btrfs_disk_key_to_cpu ( key , & disk_key ) ;
2007-03-23 22:56:19 +03:00
}
2017-06-29 06:56:53 +03:00
static inline void btrfs_dir_item_key_to_cpu ( const struct extent_buffer * eb ,
const struct btrfs_dir_item * item ,
struct btrfs_key * key )
2007-04-21 04:23:12 +04:00
{
2007-10-16 00:14:19 +04:00
struct btrfs_disk_key disk_key ;
btrfs_dir_item_key ( eb , item , & disk_key ) ;
btrfs_disk_key_to_cpu ( key , & disk_key ) ;
2007-04-21 04:23:12 +04:00
}
2017-01-18 10:24:37 +03:00
static inline u8 btrfs_key_type ( const struct btrfs_key * key )
2007-03-13 23:47:54 +03:00
{
2007-10-16 00:14:19 +04:00
return key - > type ;
2007-03-13 23:47:54 +03:00
}
2007-10-16 00:14:19 +04:00
static inline void btrfs_set_key_type ( struct btrfs_key * key , u8 val )
2007-03-13 23:47:54 +03:00
{
2007-10-16 00:14:19 +04:00
key - > type = val ;
2007-03-13 23:47:54 +03:00
}
2007-10-16 00:14:19 +04:00
/* struct btrfs_header */
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_HEADER_FUNCS ( header_bytenr , struct btrfs_header , bytenr , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_HEADER_FUNCS ( header_generation , struct btrfs_header ,
generation , 64 ) ;
BTRFS_SETGET_HEADER_FUNCS ( header_owner , struct btrfs_header , owner , 64 ) ;
BTRFS_SETGET_HEADER_FUNCS ( header_nritems , struct btrfs_header , nritems , 32 ) ;
2008-04-01 19:21:32 +04:00
BTRFS_SETGET_HEADER_FUNCS ( header_flags , struct btrfs_header , flags , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_HEADER_FUNCS ( header_level , struct btrfs_header , level , 8 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_header_generation , struct btrfs_header ,
generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_header_owner , struct btrfs_header , owner , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_header_nritems , struct btrfs_header ,
nritems , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_header_bytenr , struct btrfs_header , bytenr , 64 ) ;
2007-04-09 18:42:37 +04:00
2017-06-29 06:56:53 +03:00
static inline int btrfs_header_flag ( const struct extent_buffer * eb , u64 flag )
2008-04-01 19:21:32 +04:00
{
return ( btrfs_header_flags ( eb ) & flag ) = = flag ;
}
static inline int btrfs_set_header_flag ( struct extent_buffer * eb , u64 flag )
{
u64 flags = btrfs_header_flags ( eb ) ;
btrfs_set_header_flags ( eb , flags | flag ) ;
return ( flags & flag ) = = flag ;
}
static inline int btrfs_clear_header_flag ( struct extent_buffer * eb , u64 flag )
{
u64 flags = btrfs_header_flags ( eb ) ;
btrfs_set_header_flags ( eb , flags & ~ flag ) ;
return ( flags & flag ) = = flag ;
}
2017-06-29 06:56:53 +03:00
static inline int btrfs_header_backref_rev ( const struct extent_buffer * eb )
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
{
u64 flags = btrfs_header_flags ( eb ) ;
return flags > > BTRFS_BACKREF_REV_SHIFT ;
}
static inline void btrfs_set_header_backref_rev ( struct extent_buffer * eb ,
int rev )
{
u64 flags = btrfs_header_flags ( eb ) ;
flags & = ~ BTRFS_BACKREF_REV_MASK ;
flags | = ( u64 ) rev < < BTRFS_BACKREF_REV_SHIFT ;
btrfs_set_header_flags ( eb , flags ) ;
}
2013-09-24 13:12:38 +04:00
static inline unsigned long btrfs_header_fsid ( void )
2007-04-09 18:42:37 +04:00
{
2013-08-20 15:20:14 +04:00
return offsetof ( struct btrfs_header , fsid ) ;
2007-04-09 18:42:37 +04:00
}
2017-06-29 06:56:53 +03:00
static inline unsigned long btrfs_header_chunk_tree_uuid ( const struct extent_buffer * eb )
2008-04-15 23:41:47 +04:00
{
2013-08-20 15:20:15 +04:00
return offsetof ( struct btrfs_header , chunk_tree_uuid ) ;
2008-04-15 23:41:47 +04:00
}
2017-06-29 06:56:53 +03:00
static inline int btrfs_is_leaf ( const struct extent_buffer * eb )
2007-03-13 23:47:54 +03:00
{
2009-01-06 05:25:51 +03:00
return btrfs_header_level ( eb ) = = 0 ;
2007-03-13 23:47:54 +03:00
}
2007-10-16 00:14:19 +04:00
/* struct btrfs_root_item */
2008-10-29 21:49:05 +03:00
BTRFS_SETGET_FUNCS ( disk_root_generation , struct btrfs_root_item ,
generation , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_FUNCS ( disk_root_refs , struct btrfs_root_item , refs , 32 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_FUNCS ( disk_root_bytenr , struct btrfs_root_item , bytenr , 64 ) ;
BTRFS_SETGET_FUNCS ( disk_root_level , struct btrfs_root_item , level , 8 ) ;
2007-03-13 23:47:54 +03:00
2008-10-29 21:49:05 +03:00
BTRFS_SETGET_STACK_FUNCS ( root_generation , struct btrfs_root_item ,
generation , 64 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_STACK_FUNCS ( root_bytenr , struct btrfs_root_item , bytenr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_level , struct btrfs_root_item , level , 8 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_STACK_FUNCS ( root_dirid , struct btrfs_root_item , root_dirid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_refs , struct btrfs_root_item , refs , 32 ) ;
2008-12-02 14:36:08 +03:00
BTRFS_SETGET_STACK_FUNCS ( root_flags , struct btrfs_root_item , flags , 64 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_STACK_FUNCS ( root_used , struct btrfs_root_item , bytes_used , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_limit , struct btrfs_root_item , byte_limit , 64 ) ;
2008-10-30 21:20:02 +03:00
BTRFS_SETGET_STACK_FUNCS ( root_last_snapshot , struct btrfs_root_item ,
last_snapshot , 64 ) ;
2012-07-25 19:35:53 +04:00
BTRFS_SETGET_STACK_FUNCS ( root_generation_v2 , struct btrfs_root_item ,
generation_v2 , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_ctransid , struct btrfs_root_item ,
ctransid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_otransid , struct btrfs_root_item ,
otransid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_stransid , struct btrfs_root_item ,
stransid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_rtransid , struct btrfs_root_item ,
rtransid , 64 ) ;
2007-03-14 21:14:43 +03:00
2017-06-29 06:56:53 +03:00
static inline bool btrfs_root_readonly ( const struct btrfs_root * root )
2010-12-20 11:04:08 +03:00
{
2012-04-13 19:49:04 +04:00
return ( root - > root_item . flags & cpu_to_le64 ( BTRFS_ROOT_SUBVOL_RDONLY ) ) ! = 0 ;
2010-12-20 11:04:08 +03:00
}
2017-06-29 06:56:53 +03:00
static inline bool btrfs_root_dead ( const struct btrfs_root * root )
2014-04-15 18:41:44 +04:00
{
return ( root - > root_item . flags & cpu_to_le64 ( BTRFS_ROOT_SUBVOL_DEAD ) ) ! = 0 ;
}
2011-11-03 23:17:42 +04:00
/* struct btrfs_root_backup */
BTRFS_SETGET_STACK_FUNCS ( backup_tree_root , struct btrfs_root_backup ,
tree_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_tree_root_gen , struct btrfs_root_backup ,
tree_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_tree_root_level , struct btrfs_root_backup ,
tree_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_chunk_root , struct btrfs_root_backup ,
chunk_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_chunk_root_gen , struct btrfs_root_backup ,
chunk_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_chunk_root_level , struct btrfs_root_backup ,
chunk_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_extent_root , struct btrfs_root_backup ,
extent_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_extent_root_gen , struct btrfs_root_backup ,
extent_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_extent_root_level , struct btrfs_root_backup ,
extent_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_fs_root , struct btrfs_root_backup ,
fs_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_fs_root_gen , struct btrfs_root_backup ,
fs_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_fs_root_level , struct btrfs_root_backup ,
fs_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_dev_root , struct btrfs_root_backup ,
dev_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_dev_root_gen , struct btrfs_root_backup ,
dev_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_dev_root_level , struct btrfs_root_backup ,
dev_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_csum_root , struct btrfs_root_backup ,
csum_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_csum_root_gen , struct btrfs_root_backup ,
csum_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_csum_root_level , struct btrfs_root_backup ,
csum_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_total_bytes , struct btrfs_root_backup ,
total_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_bytes_used , struct btrfs_root_backup ,
bytes_used , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_num_devices , struct btrfs_root_backup ,
num_devices , 64 ) ;
2012-01-17 00:04:48 +04:00
/* struct btrfs_balance_item */
BTRFS_SETGET_FUNCS ( balance_flags , struct btrfs_balance_item , flags , 64 ) ;
2008-12-02 15:17:45 +03:00
2017-06-29 06:56:53 +03:00
static inline void btrfs_balance_data ( const struct extent_buffer * eb ,
const struct btrfs_balance_item * bi ,
2012-01-17 00:04:48 +04:00
struct btrfs_disk_balance_args * ba )
{
read_eb_member ( eb , bi , struct btrfs_balance_item , data , ba ) ;
}
static inline void btrfs_set_balance_data ( struct extent_buffer * eb ,
2017-06-29 06:56:53 +03:00
struct btrfs_balance_item * bi ,
const struct btrfs_disk_balance_args * ba )
2012-01-17 00:04:48 +04:00
{
write_eb_member ( eb , bi , struct btrfs_balance_item , data , ba ) ;
}
2017-06-29 06:56:53 +03:00
static inline void btrfs_balance_meta ( const struct extent_buffer * eb ,
const struct btrfs_balance_item * bi ,
2012-01-17 00:04:48 +04:00
struct btrfs_disk_balance_args * ba )
{
read_eb_member ( eb , bi , struct btrfs_balance_item , meta , ba ) ;
}
static inline void btrfs_set_balance_meta ( struct extent_buffer * eb ,
2017-06-29 06:56:53 +03:00
struct btrfs_balance_item * bi ,
const struct btrfs_disk_balance_args * ba )
2012-01-17 00:04:48 +04:00
{
write_eb_member ( eb , bi , struct btrfs_balance_item , meta , ba ) ;
}
2017-06-29 06:56:53 +03:00
static inline void btrfs_balance_sys ( const struct extent_buffer * eb ,
const struct btrfs_balance_item * bi ,
2012-01-17 00:04:48 +04:00
struct btrfs_disk_balance_args * ba )
{
read_eb_member ( eb , bi , struct btrfs_balance_item , sys , ba ) ;
}
static inline void btrfs_set_balance_sys ( struct extent_buffer * eb ,
2017-06-29 06:56:53 +03:00
struct btrfs_balance_item * bi ,
const struct btrfs_disk_balance_args * ba )
2012-01-17 00:04:48 +04:00
{
write_eb_member ( eb , bi , struct btrfs_balance_item , sys , ba ) ;
}
static inline void
btrfs_disk_balance_args_to_cpu ( struct btrfs_balance_args * cpu ,
2017-06-29 06:56:53 +03:00
const struct btrfs_disk_balance_args * disk )
2012-01-17 00:04:48 +04:00
{
memset ( cpu , 0 , sizeof ( * cpu ) ) ;
cpu - > profiles = le64_to_cpu ( disk - > profiles ) ;
cpu - > usage = le64_to_cpu ( disk - > usage ) ;
cpu - > devid = le64_to_cpu ( disk - > devid ) ;
cpu - > pstart = le64_to_cpu ( disk - > pstart ) ;
cpu - > pend = le64_to_cpu ( disk - > pend ) ;
cpu - > vstart = le64_to_cpu ( disk - > vstart ) ;
cpu - > vend = le64_to_cpu ( disk - > vend ) ;
cpu - > target = le64_to_cpu ( disk - > target ) ;
cpu - > flags = le64_to_cpu ( disk - > flags ) ;
2014-05-07 19:37:51 +04:00
cpu - > limit = le64_to_cpu ( disk - > limit ) ;
2016-11-01 16:21:23 +03:00
cpu - > stripes_min = le32_to_cpu ( disk - > stripes_min ) ;
cpu - > stripes_max = le32_to_cpu ( disk - > stripes_max ) ;
2012-01-17 00:04:48 +04:00
}
static inline void
btrfs_cpu_balance_args_to_disk ( struct btrfs_disk_balance_args * disk ,
2017-06-29 06:56:53 +03:00
const struct btrfs_balance_args * cpu )
2012-01-17 00:04:48 +04:00
{
memset ( disk , 0 , sizeof ( * disk ) ) ;
disk - > profiles = cpu_to_le64 ( cpu - > profiles ) ;
disk - > usage = cpu_to_le64 ( cpu - > usage ) ;
disk - > devid = cpu_to_le64 ( cpu - > devid ) ;
disk - > pstart = cpu_to_le64 ( cpu - > pstart ) ;
disk - > pend = cpu_to_le64 ( cpu - > pend ) ;
disk - > vstart = cpu_to_le64 ( cpu - > vstart ) ;
disk - > vend = cpu_to_le64 ( cpu - > vend ) ;
disk - > target = cpu_to_le64 ( cpu - > target ) ;
disk - > flags = cpu_to_le64 ( cpu - > flags ) ;
2014-05-07 19:37:51 +04:00
disk - > limit = cpu_to_le64 ( cpu - > limit ) ;
2016-11-01 16:21:23 +03:00
disk - > stripes_min = cpu_to_le32 ( cpu - > stripes_min ) ;
disk - > stripes_max = cpu_to_le32 ( cpu - > stripes_max ) ;
2012-01-17 00:04:48 +04:00
}
/* struct btrfs_super_block */
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_bytenr , struct btrfs_super_block , bytenr , 64 ) ;
2008-05-07 19:43:44 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_flags , struct btrfs_super_block , flags , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_generation , struct btrfs_super_block ,
generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_root , struct btrfs_super_block , root , 64 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_sys_array_size ,
struct btrfs_super_block , sys_chunk_array_size , 32 ) ;
2008-10-29 21:49:05 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_chunk_root_generation ,
struct btrfs_super_block , chunk_root_generation , 64 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_root_level , struct btrfs_super_block ,
root_level , 8 ) ;
2008-03-24 22:01:56 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_chunk_root , struct btrfs_super_block ,
chunk_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_chunk_root_level , struct btrfs_super_block ,
2008-09-06 00:13:11 +04:00
chunk_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_log_root , struct btrfs_super_block ,
log_root , 64 ) ;
2008-12-09 00:40:21 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_log_root_transid , struct btrfs_super_block ,
log_root_transid , 64 ) ;
2008-09-06 00:13:11 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_log_root_level , struct btrfs_super_block ,
log_root_level , 8 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_total_bytes , struct btrfs_super_block ,
total_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_bytes_used , struct btrfs_super_block ,
bytes_used , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_sectorsize , struct btrfs_super_block ,
sectorsize , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_nodesize , struct btrfs_super_block ,
nodesize , 32 ) ;
2007-11-30 19:30:34 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_stripesize , struct btrfs_super_block ,
stripesize , 32 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_root_dir , struct btrfs_super_block ,
root_dir_objectid , 64 ) ;
2008-03-24 22:02:07 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_num_devices , struct btrfs_super_block ,
num_devices , 64 ) ;
2008-12-02 14:36:08 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_compat_flags , struct btrfs_super_block ,
compat_flags , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_compat_ro_flags , struct btrfs_super_block ,
2009-12-18 00:32:27 +03:00
compat_ro_flags , 64 ) ;
2008-12-02 14:36:08 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_incompat_flags , struct btrfs_super_block ,
incompat_flags , 64 ) ;
2008-12-02 15:17:45 +03:00
BTRFS_SETGET_STACK_FUNCS ( super_csum_type , struct btrfs_super_block ,
csum_type , 16 ) ;
2010-06-21 22:48:16 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_cache_generation , struct btrfs_super_block ,
cache_generation , 64 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_magic , struct btrfs_super_block , magic , 64 ) ;
2013-08-15 19:11:22 +04:00
BTRFS_SETGET_STACK_FUNCS ( super_uuid_tree_generation , struct btrfs_super_block ,
uuid_tree_generation , 64 ) ;
2008-12-02 15:17:45 +03:00
2017-06-29 06:56:53 +03:00
static inline int btrfs_super_csum_size ( const struct btrfs_super_block * s )
2008-12-02 15:17:45 +03:00
{
2013-03-06 18:57:46 +04:00
u16 t = btrfs_super_csum_type ( s ) ;
/*
* csum type is validated at mount time
*/
2008-12-02 15:17:45 +03:00
return btrfs_csum_sizes [ t ] ;
}
2007-03-21 18:12:56 +03:00
2016-09-23 23:44:44 +03:00
/*
* The leaf data grows from end - to - front in the node .
* this returns the address of the start of the last item ,
* which is the stop of the leaf data stack
*/
2017-06-29 06:56:53 +03:00
static inline unsigned int leaf_data_end ( const struct btrfs_fs_info * fs_info ,
const struct extent_buffer * leaf )
2016-09-23 23:44:44 +03:00
{
u32 nr = btrfs_header_nritems ( leaf ) ;
if ( nr = = 0 )
2016-06-23 01:54:23 +03:00
return BTRFS_LEAF_DATA_SIZE ( fs_info ) ;
2016-09-23 23:44:44 +03:00
return btrfs_item_offset_nr ( leaf , nr - 1 ) ;
}
2007-10-16 00:14:19 +04:00
/* struct btrfs_file_extent_item */
BTRFS_SETGET_FUNCS ( file_extent_type , struct btrfs_file_extent_item , type , 8 ) ;
2013-07-16 07:19:18 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_disk_bytenr ,
struct btrfs_file_extent_item , disk_bytenr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_offset ,
struct btrfs_file_extent_item , offset , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_generation ,
struct btrfs_file_extent_item , generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_num_bytes ,
struct btrfs_file_extent_item , num_bytes , 64 ) ;
2013-11-14 06:11:49 +04:00
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_disk_num_bytes ,
struct btrfs_file_extent_item , disk_num_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_compression ,
struct btrfs_file_extent_item , compression , 8 ) ;
2007-03-20 21:38:32 +03:00
2009-01-06 05:25:51 +03:00
static inline unsigned long
2017-06-29 06:56:53 +03:00
btrfs_file_extent_inline_start ( const struct btrfs_file_extent_item * e )
2007-04-19 21:37:44 +04:00
{
2014-07-24 19:34:58 +04:00
return ( unsigned long ) e + BTRFS_FILE_EXTENT_INLINE_DATA_START ;
2007-04-19 21:37:44 +04:00
}
static inline u32 btrfs_file_extent_calc_inline_size ( u32 datasize )
{
2014-07-24 19:34:58 +04:00
return BTRFS_FILE_EXTENT_INLINE_DATA_START + datasize ;
2007-03-20 21:38:32 +03:00
}
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_FUNCS ( file_extent_disk_bytenr , struct btrfs_file_extent_item ,
disk_bytenr , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_FUNCS ( file_extent_generation , struct btrfs_file_extent_item ,
generation , 64 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_FUNCS ( file_extent_disk_num_bytes , struct btrfs_file_extent_item ,
disk_num_bytes , 64 ) ;
2007-10-16 00:14:19 +04:00
BTRFS_SETGET_FUNCS ( file_extent_offset , struct btrfs_file_extent_item ,
offset , 64 ) ;
2007-10-16 00:15:53 +04:00
BTRFS_SETGET_FUNCS ( file_extent_num_bytes , struct btrfs_file_extent_item ,
num_bytes , 64 ) ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
BTRFS_SETGET_FUNCS ( file_extent_ram_bytes , struct btrfs_file_extent_item ,
ram_bytes , 64 ) ;
BTRFS_SETGET_FUNCS ( file_extent_compression , struct btrfs_file_extent_item ,
compression , 8 ) ;
BTRFS_SETGET_FUNCS ( file_extent_encryption , struct btrfs_file_extent_item ,
encryption , 8 ) ;
BTRFS_SETGET_FUNCS ( file_extent_other_encoding , struct btrfs_file_extent_item ,
other_encoding , 16 ) ;
/*
* this returns the number of bytes used by the item on disk , minus the
* size of any extent headers . If a file is compressed on disk , this is
* the compressed size
*/
2017-06-29 06:56:53 +03:00
static inline u32 btrfs_file_extent_inline_item_len (
const struct extent_buffer * eb ,
struct btrfs_item * e )
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
{
2014-07-24 19:34:58 +04:00
return btrfs_item_size ( eb , e ) - BTRFS_FILE_EXTENT_INLINE_DATA_START ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
}
2007-03-20 21:38:32 +03:00
2014-01-04 09:07:00 +04:00
/* this returns the number of file bytes represented by the inline item.
* If an item is compressed , this is the uncompressed size
*/
2017-06-29 06:56:53 +03:00
static inline u32 btrfs_file_extent_inline_len ( const struct extent_buffer * eb ,
int slot ,
const struct btrfs_file_extent_item * fi )
2014-01-04 09:07:00 +04:00
{
struct btrfs_map_token token ;
btrfs_init_map_token ( & token ) ;
/*
* return the space used on disk if this item isn ' t
* compressed or encoded
*/
if ( btrfs_token_file_extent_compression ( eb , fi , & token ) = = 0 & &
btrfs_token_file_extent_encryption ( eb , fi , & token ) = = 0 & &
btrfs_token_file_extent_other_encoding ( eb , fi , & token ) = = 0 ) {
return btrfs_file_extent_inline_item_len ( eb ,
btrfs_item_nr ( slot ) ) ;
}
/* otherwise use the ram bytes field */
return btrfs_token_file_extent_ram_bytes ( eb , fi , & token ) ;
}
2012-05-25 18:06:10 +04:00
/* btrfs_dev_stats_item */
2017-06-29 06:56:53 +03:00
static inline u64 btrfs_dev_stats_value ( const struct extent_buffer * eb ,
const struct btrfs_dev_stats_item * ptr ,
2012-05-25 18:06:10 +04:00
int index )
{
u64 val ;
read_extent_buffer ( eb , & val ,
offsetof ( struct btrfs_dev_stats_item , values ) +
( ( unsigned long ) ptr ) + ( index * sizeof ( u64 ) ) ,
sizeof ( val ) ) ;
return val ;
}
static inline void btrfs_set_dev_stats_value ( struct extent_buffer * eb ,
struct btrfs_dev_stats_item * ptr ,
int index , u64 val )
{
write_extent_buffer ( eb , & val ,
offsetof ( struct btrfs_dev_stats_item , values ) +
( ( unsigned long ) ptr ) + ( index * sizeof ( u64 ) ) ,
sizeof ( val ) ) ;
}
2011-09-13 13:06:07 +04:00
/* btrfs_qgroup_status_item */
BTRFS_SETGET_FUNCS ( qgroup_status_generation , struct btrfs_qgroup_status_item ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_status_version , struct btrfs_qgroup_status_item ,
version , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_status_flags , struct btrfs_qgroup_status_item ,
flags , 64 ) ;
2013-04-25 20:04:51 +04:00
BTRFS_SETGET_FUNCS ( qgroup_status_rescan , struct btrfs_qgroup_status_item ,
rescan , 64 ) ;
2011-09-13 13:06:07 +04:00
/* btrfs_qgroup_info_item */
BTRFS_SETGET_FUNCS ( qgroup_info_generation , struct btrfs_qgroup_info_item ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_rfer , struct btrfs_qgroup_info_item , rfer , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_rfer_cmpr , struct btrfs_qgroup_info_item ,
rfer_cmpr , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_excl , struct btrfs_qgroup_info_item , excl , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_excl_cmpr , struct btrfs_qgroup_info_item ,
excl_cmpr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_generation ,
struct btrfs_qgroup_info_item , generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_rfer , struct btrfs_qgroup_info_item ,
rfer , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_rfer_cmpr ,
struct btrfs_qgroup_info_item , rfer_cmpr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_excl , struct btrfs_qgroup_info_item ,
excl , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_excl_cmpr ,
struct btrfs_qgroup_info_item , excl_cmpr , 64 ) ;
/* btrfs_qgroup_limit_item */
BTRFS_SETGET_FUNCS ( qgroup_limit_flags , struct btrfs_qgroup_limit_item ,
flags , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_max_rfer , struct btrfs_qgroup_limit_item ,
max_rfer , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_max_excl , struct btrfs_qgroup_limit_item ,
max_excl , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_rsv_rfer , struct btrfs_qgroup_limit_item ,
rsv_rfer , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_rsv_excl , struct btrfs_qgroup_limit_item ,
rsv_excl , 64 ) ;
2012-11-05 20:32:20 +04:00
/* btrfs_dev_replace_item */
BTRFS_SETGET_FUNCS ( dev_replace_src_devid ,
struct btrfs_dev_replace_item , src_devid , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_cont_reading_from_srcdev_mode ,
struct btrfs_dev_replace_item , cont_reading_from_srcdev_mode ,
64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_replace_state , struct btrfs_dev_replace_item ,
replace_state , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_time_started , struct btrfs_dev_replace_item ,
time_started , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_time_stopped , struct btrfs_dev_replace_item ,
time_stopped , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_num_write_errors , struct btrfs_dev_replace_item ,
num_write_errors , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_num_uncorrectable_read_errors ,
struct btrfs_dev_replace_item , num_uncorrectable_read_errors ,
64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_cursor_left , struct btrfs_dev_replace_item ,
cursor_left , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_cursor_right , struct btrfs_dev_replace_item ,
cursor_right , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_src_devid ,
struct btrfs_dev_replace_item , src_devid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_cont_reading_from_srcdev_mode ,
struct btrfs_dev_replace_item ,
cont_reading_from_srcdev_mode , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_replace_state ,
struct btrfs_dev_replace_item , replace_state , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_time_started ,
struct btrfs_dev_replace_item , time_started , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_time_stopped ,
struct btrfs_dev_replace_item , time_stopped , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_num_write_errors ,
struct btrfs_dev_replace_item , num_write_errors , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_num_uncorrectable_read_errors ,
struct btrfs_dev_replace_item ,
num_uncorrectable_read_errors , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_cursor_left ,
struct btrfs_dev_replace_item , cursor_left , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_cursor_right ,
struct btrfs_dev_replace_item , cursor_right , 64 ) ;
2007-03-14 17:31:29 +03:00
/* helper function to cast into the data area of the leaf. */
# define btrfs_item_ptr(leaf, slot, type) \
2017-05-29 09:43:43 +03:00
( ( type * ) ( BTRFS_LEAF_DATA_OFFSET + \
2007-10-16 00:14:19 +04:00
btrfs_item_offset_nr ( leaf , slot ) ) )
# define btrfs_item_ptr_offset(leaf, slot) \
2017-05-29 09:43:43 +03:00
( ( unsigned long ) ( BTRFS_LEAF_DATA_OFFSET + \
2007-10-16 00:14:19 +04:00
btrfs_item_offset_nr ( leaf , slot ) ) )
2007-03-14 17:31:29 +03:00
2010-09-17 00:19:09 +04:00
static inline bool btrfs_mixed_space_info ( struct btrfs_space_info * space_info )
{
return ( ( space_info - > flags & BTRFS_BLOCK_GROUP_METADATA ) & &
( space_info - > flags & BTRFS_BLOCK_GROUP_DATA ) ) ;
}
2011-09-21 23:05:58 +04:00
static inline gfp_t btrfs_alloc_write_mask ( struct address_space * mapping )
{
2015-11-07 03:28:49 +03:00
return mapping_gfp_constraint ( mapping , ~ __GFP_FS ) ;
2011-09-21 23:05:58 +04:00
}
2007-04-17 21:26:50 +04:00
/* extent-tree.c */
2015-02-04 17:59:29 +03:00
2017-08-19 00:15:18 +03:00
enum btrfs_inline_ref_type {
BTRFS_REF_TYPE_INVALID = 0 ,
BTRFS_REF_TYPE_BLOCK = 1 ,
BTRFS_REF_TYPE_DATA = 2 ,
BTRFS_REF_TYPE_ANY = 3 ,
} ;
int btrfs_get_extent_inline_ref_type ( const struct extent_buffer * eb ,
struct btrfs_extent_inline_ref * iref ,
enum btrfs_inline_ref_type is_data ) ;
2016-06-23 01:54:24 +03:00
u64 btrfs_csum_bytes_to_leaves ( struct btrfs_fs_info * fs_info , u64 csum_bytes ) ;
2015-02-04 17:59:29 +03:00
2016-06-16 18:07:27 +03:00
static inline u64 btrfs_calc_trans_metadata_size ( struct btrfs_fs_info * fs_info ,
2011-07-15 19:16:44 +04:00
unsigned num_items )
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
{
Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().
In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().
The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0);
Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-02 11:20:01 +03:00
return ( u64 ) fs_info - > nodesize * BTRFS_MAX_LEVEL * 2 * num_items ;
2011-08-19 18:29:59 +04:00
}
/*
* Doing a truncate won ' t result in new nodes or leaves , just what we need for
* COW .
*/
2016-06-16 18:07:27 +03:00
static inline u64 btrfs_calc_trunc_metadata_size ( struct btrfs_fs_info * fs_info ,
2011-08-19 18:29:59 +04:00
unsigned num_items )
{
Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().
In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().
The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0);
Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-02 11:20:01 +03:00
return ( u64 ) fs_info - > nodesize * BTRFS_MAX_LEVEL * num_items ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
}
2013-06-12 21:56:06 +04:00
int btrfs_should_throttle_delayed_refs ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
2014-01-23 19:54:11 +04:00
int btrfs_check_space_for_delayed_refs ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
Btrfs: don't do unnecessary delalloc flushes when relocating
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).
These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).
So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.
This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:
CPU 1 CPU 2
btrfs_relocate_block_group(bg X)
starts direct IO write,
target inode currently has no
ordered extents ongoing nor
dirty pages (delalloc regions),
therefore the root for our inode
is not in the list
fs_info->ordered_roots
btrfs_direct_IO()
__blockdev_direct_IO()
btrfs_get_blocks_direct()
btrfs_lock_extent_direct()
locks range in the io tree
btrfs_new_extent_direct()
btrfs_reserve_extent()
--> extent allocated
from bg X
btrfs_inc_block_group_ro(bg X)
btrfs_start_delalloc_roots()
__start_delalloc_inodes()
--> does nothing, no dealloc ranges
in the inode's io tree so the
inode's root is not in the list
fs_info->delalloc_roots
btrfs_wait_ordered_roots()
--> does not find the inode's root in the
list fs_info->ordered_roots
--> ends up not waiting for the direct IO
write started by the task at CPU 2
relocate_block_group(rc->stage ==
MOVE_DATA_EXTENTS)
prepare_to_relocate()
btrfs_commit_transaction()
iterates the extent tree, using its
commit root and moves extents into new
locations
btrfs_add_ordered_extent_dio()
--> now a ordered extent is
created and added to the
list root->ordered_extents
and the root added to the
list fs_info->ordered_roots
--> this is too late and the
task at CPU 1 already
started the relocation
btrfs_commit_transaction()
btrfs_finish_ordered_io()
btrfs_alloc_reserved_file_extent()
--> adds delayed data reference
for the extent allocated
from bg X
relocate_block_group(rc->stage ==
UPDATE_DATA_PTRS)
prepare_to_relocate()
btrfs_commit_transaction()
--> delayed refs are run, so an extent
item for the allocated extent from
bg X is added to extent tree
--> commit roots are switched, so the
next scan in the extent tree will
see the extent item
sees the extent in the extent tree
When this happens the relocation produces the following warning when it
finishes:
[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:
[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP <ffff8802046878f0>
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-04-26 17:39:32 +03:00
void btrfs_dec_block_group_reservations ( struct btrfs_fs_info * fs_info ,
const u64 start ) ;
void btrfs_wait_block_group_reservations ( struct btrfs_block_group_cache * bg ) ;
2016-05-09 15:15:41 +03:00
bool btrfs_inc_nocow_writers ( struct btrfs_fs_info * fs_info , u64 bytenr ) ;
void btrfs_dec_nocow_writers ( struct btrfs_fs_info * fs_info , u64 bytenr ) ;
void btrfs_wait_nocow_writers ( struct btrfs_block_group_cache * bg ) ;
2009-04-03 17:47:43 +04:00
void btrfs_put_block_group ( struct btrfs_block_group_cache * cache ) ;
2009-03-13 17:10:06 +03:00
int btrfs_run_delayed_refs ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info , unsigned long count ) ;
int btrfs_async_run_delayed_refs ( struct btrfs_fs_info * fs_info ,
2016-04-12 00:37:40 +03:00
unsigned long count , u64 transid , int wait ) ;
2016-06-23 01:54:24 +03:00
int btrfs_lookup_data_extent ( struct btrfs_fs_info * fs_info , u64 start , u64 len ) ;
2010-05-16 18:48:46 +04:00
int btrfs_lookup_extent_info ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info , u64 bytenr ,
2013-03-07 23:22:04 +04:00
u64 offset , int metadata , u64 * refs , u64 * flags ) ;
2016-06-23 01:54:24 +03:00
int btrfs_pin_extent ( struct btrfs_fs_info * fs_info ,
2009-09-12 00:11:19 +04:00
u64 bytenr , u64 num , int reserved ) ;
2016-06-23 01:54:24 +03:00
int btrfs_pin_extent_for_log_replay ( struct btrfs_fs_info * fs_info ,
2011-11-01 04:52:39 +04:00
u64 bytenr , u64 num_bytes ) ;
2016-06-23 01:54:24 +03:00
int btrfs_exclude_logged_extents ( struct btrfs_fs_info * fs_info ,
2013-06-06 21:19:32 +04:00
struct extent_buffer * eb ) ;
2017-01-30 23:25:28 +03:00
int btrfs_cross_ref_exist ( struct btrfs_root * root ,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
u64 objectid , u64 offset , u64 bytenr ) ;
2009-01-06 05:25:51 +03:00
struct btrfs_block_group_cache * btrfs_lookup_block_group (
struct btrfs_fs_info * info ,
u64 bytenr ) ;
2015-11-19 14:45:48 +03:00
void btrfs_get_block_group ( struct btrfs_block_group_cache * cache ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
void btrfs_put_block_group ( struct btrfs_block_group_cache * cache ) ;
2013-11-01 21:07:04 +04:00
int get_block_group_index ( struct btrfs_block_group_cache * cache ) ;
2014-06-15 03:54:12 +04:00
struct extent_buffer * btrfs_alloc_tree_block ( struct btrfs_trans_handle * trans ,
2017-01-18 10:24:37 +03:00
struct btrfs_root * root ,
u64 parent , u64 root_objectid ,
const struct btrfs_disk_key * key ,
int level , u64 hint ,
u64 empty_size ) ;
2010-05-16 18:46:25 +04:00
void btrfs_free_tree_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf ,
2012-05-16 19:04:52 +04:00
u64 parent , int last_ref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int btrfs_alloc_reserved_file_extent ( struct btrfs_trans_handle * trans ,
2017-09-29 22:43:49 +03:00
struct btrfs_root * root , u64 owner ,
2015-10-26 09:11:18 +03:00
u64 offset , u64 ram_bytes ,
struct btrfs_key * ins ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int btrfs_alloc_logged_file_extent ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
u64 root_objectid , u64 owner , u64 offset ,
struct btrfs_key * ins ) ;
btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.
Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |
In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.
To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.
As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.
Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-07-25 10:51:40 +03:00
int btrfs_reserve_extent ( struct btrfs_root * root , u64 ram_bytes , u64 num_bytes ,
2013-08-14 22:02:47 +04:00
u64 min_alloc_size , u64 empty_size , u64 hint_byte ,
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 06:42:50 +04:00
struct btrfs_key * ins , int is_data , int delalloc ) ;
2007-03-16 23:20:31 +03:00
int btrfs_inc_ref ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
2014-07-02 21:54:25 +04:00
struct extent_buffer * buf , int full_backref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int btrfs_dec_ref ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
2014-07-02 21:54:25 +04:00
struct extent_buffer * buf , int full_backref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int btrfs_set_disk_extent_flags ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
u64 bytenr , u64 num_bytes , u64 flags ,
2013-05-09 21:49:30 +04:00
int level , int is_data ) ;
2008-09-23 21:14:14 +04:00
int btrfs_free_extent ( struct btrfs_trans_handle * trans ,
2017-09-29 22:43:49 +03:00
struct btrfs_root * root ,
2011-09-12 17:26:38 +04:00
u64 bytenr , u64 num_bytes , u64 parent , u64 root_objectid ,
Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-23 09:52:54 +03:00
u64 owner , u64 offset ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
2016-06-23 01:54:24 +03:00
int btrfs_free_reserved_extent ( struct btrfs_fs_info * fs_info ,
u64 start , u64 len , int delalloc ) ;
int btrfs_free_and_pin_reserved_extent ( struct btrfs_fs_info * fs_info ,
2011-11-01 04:52:39 +04:00
u64 start , u64 len ) ;
2017-02-10 21:20:56 +03:00
void btrfs_prepare_extent_commit ( struct btrfs_fs_info * fs_info ) ;
2007-06-28 23:57:36 +04:00
int btrfs_finish_extent_commit ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
2007-04-17 21:26:50 +04:00
int btrfs_inc_extent_ref ( struct btrfs_trans_handle * trans ,
2017-09-29 22:43:49 +03:00
struct btrfs_root * root ,
2008-09-23 21:14:14 +04:00
u64 bytenr , u64 num_bytes , u64 parent ,
Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-23 09:52:54 +03:00
u64 root_objectid , u64 owner , u64 offset ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
2015-04-06 22:46:08 +03:00
int btrfs_start_dirty_block_groups ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
2007-04-27 00:46:15 +04:00
int btrfs_write_dirty_block_groups ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
2015-03-03 00:37:31 +03:00
int btrfs_setup_space_cache ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
int btrfs_extent_readonly ( struct btrfs_fs_info * fs_info , u64 bytenr ) ;
2007-04-27 00:46:15 +04:00
int btrfs_free_block_groups ( struct btrfs_fs_info * info ) ;
2016-06-21 17:40:19 +03:00
int btrfs_read_block_groups ( struct btrfs_fs_info * info ) ;
2016-06-22 04:16:51 +03:00
int btrfs_can_relocate ( struct btrfs_fs_info * fs_info , u64 bytenr ) ;
2008-03-24 22:01:56 +03:00
int btrfs_make_block_group ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info , u64 bytes_used ,
2017-07-27 14:22:11 +03:00
u64 type , u64 chunk_offset , u64 size ) ;
2015-11-14 02:57:16 +03:00
struct btrfs_trans_handle * btrfs_start_trans_remove_block_group (
Btrfs: fix the number of transaction units needed to remove a block group
We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.
While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.
However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.
In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:
http://www.spinics.net/lists/linux-btrfs/msg46070.html
So fix this by reserving all the necessary units of metadata.
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-14 02:57:17 +03:00
struct btrfs_fs_info * fs_info ,
const u64 chunk_offset ) ;
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 18:09:34 +04:00
int btrfs_remove_block_group ( struct btrfs_trans_handle * trans ,
2016-06-22 04:16:51 +03:00
struct btrfs_fs_info * fs_info , u64 group_start ,
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 00:14:15 +03:00
struct extent_map * em ) ;
2014-09-18 19:20:02 +04:00
void btrfs_delete_unused_bgs ( struct btrfs_fs_info * fs_info ) ;
2015-06-15 16:41:19 +03:00
void btrfs_get_block_group_trimming ( struct btrfs_block_group_cache * cache ) ;
void btrfs_put_block_group_trimming ( struct btrfs_block_group_cache * cache ) ;
2012-09-12 00:57:25 +04:00
void btrfs_create_pending_block_groups ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info ) ;
2017-05-17 18:38:35 +03:00
u64 btrfs_data_alloc_profile ( struct btrfs_fs_info * fs_info ) ;
u64 btrfs_metadata_alloc_profile ( struct btrfs_fs_info * fs_info ) ;
u64 btrfs_system_alloc_profile ( struct btrfs_fs_info * fs_info ) ;
2009-03-10 19:39:20 +03:00
void btrfs_clear_space_info_full ( struct btrfs_fs_info * info ) ;
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 15:33:38 +04:00
enum btrfs_reserve_flush_enum {
/* If we are in the transaction, we can't flush anything.*/
BTRFS_RESERVE_NO_FLUSH ,
/*
* Flushing delalloc may cause deadlock somewhere , in this
* case , use FLUSH LIMIT
*/
BTRFS_RESERVE_FLUSH_LIMIT ,
BTRFS_RESERVE_FLUSH_ALL ,
} ;
2016-03-25 20:25:56 +03:00
enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR = 1 ,
FLUSH_DELAYED_ITEMS = 2 ,
FLUSH_DELALLOC = 3 ,
FLUSH_DELALLOC_WAIT = 4 ,
ALLOC_CHUNK = 5 ,
COMMIT_TRANS = 6 ,
} ;
2017-02-20 14:50:36 +03:00
int btrfs_alloc_data_chunk_ondemand ( struct btrfs_inode * inode , u64 bytes ) ;
2017-02-27 10:10:38 +03:00
int btrfs_check_data_free_space ( struct inode * inode ,
struct extent_changeset * * reserved , u64 start , u64 len ) ;
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)
Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)
[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.
And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.
[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-27 10:10:39 +03:00
void btrfs_free_reserved_data_space ( struct inode * inode ,
struct extent_changeset * reserved , u64 start , u64 len ) ;
void btrfs_delalloc_release_space ( struct inode * inode ,
struct extent_changeset * reserved , u64 start , u64 len ) ;
2015-10-08 13:19:37 +03:00
void btrfs_free_reserved_data_space_noquota ( struct inode * inode , u64 start ,
u64 len ) ;
Btrfs: fix -ENOSPC when finishing block group creation
While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:
[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.
So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.
Fix this by doing proper reservation in the chunk block reserve.
The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-20 16:01:54 +03:00
void btrfs_trans_release_chunk_metadata ( struct btrfs_trans_handle * trans ) ;
2010-05-16 18:49:58 +04:00
int btrfs_orphan_reserve_metadata ( struct btrfs_trans_handle * trans ,
2017-02-20 14:50:39 +03:00
struct btrfs_inode * inode ) ;
2017-02-20 14:50:40 +03:00
void btrfs_orphan_release_metadata ( struct btrfs_inode * inode ) ;
2013-02-28 14:04:33 +04:00
int btrfs_subvolume_reserve_metadata ( struct btrfs_root * root ,
struct btrfs_block_rsv * rsv ,
int nitems ,
2013-07-10 00:37:21 +04:00
u64 * qgroup_reserved , bool use_global_rsv ) ;
2016-06-23 01:54:24 +03:00
void btrfs_subvolume_release_metadata ( struct btrfs_fs_info * fs_info ,
2017-02-10 21:18:18 +03:00
struct btrfs_block_rsv * rsv ) ;
2017-10-19 21:15:55 +03:00
void btrfs_delalloc_release_extents ( struct btrfs_inode * inode , u64 num_bytes ) ;
2017-02-20 14:50:41 +03:00
int btrfs_delalloc_reserve_metadata ( struct btrfs_inode * inode , u64 num_bytes ) ;
2017-02-20 14:50:42 +03:00
void btrfs_delalloc_release_metadata ( struct btrfs_inode * inode , u64 num_bytes ) ;
2017-02-27 10:10:38 +03:00
int btrfs_delalloc_reserve_space ( struct inode * inode ,
struct extent_changeset * * reserved , u64 start , u64 len ) ;
2012-09-06 14:02:28 +04:00
void btrfs_init_block_rsv ( struct btrfs_block_rsv * rsv , unsigned short type ) ;
2016-06-23 01:54:24 +03:00
struct btrfs_block_rsv * btrfs_alloc_block_rsv ( struct btrfs_fs_info * fs_info ,
2012-09-06 14:02:28 +04:00
unsigned short type ) ;
2017-10-19 21:15:57 +03:00
void btrfs_init_metadata_block_rsv ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * rsv ,
unsigned short type ) ;
2016-06-23 01:54:24 +03:00
void btrfs_free_block_rsv ( struct btrfs_fs_info * fs_info ,
2010-05-16 18:46:25 +04:00
struct btrfs_block_rsv * rsv ) ;
2015-04-07 04:17:00 +03:00
void __btrfs_free_block_rsv ( struct btrfs_block_rsv * rsv ) ;
2011-08-30 20:34:28 +04:00
int btrfs_block_rsv_add ( struct btrfs_root * root ,
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 15:33:38 +04:00
struct btrfs_block_rsv * block_rsv , u64 num_bytes ,
enum btrfs_reserve_flush_enum flush ) ;
2016-06-23 01:54:24 +03:00
int btrfs_block_rsv_check ( struct btrfs_block_rsv * block_rsv , int min_factor ) ;
2011-10-18 20:15:48 +04:00
int btrfs_block_rsv_refill ( struct btrfs_root * root ,
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 15:33:38 +04:00
struct btrfs_block_rsv * block_rsv , u64 min_reserved ,
enum btrfs_reserve_flush_enum flush ) ;
2010-05-16 18:46:25 +04:00
int btrfs_block_rsv_migrate ( struct btrfs_block_rsv * src_rsv ,
2016-03-25 20:25:48 +03:00
struct btrfs_block_rsv * dst_rsv , u64 num_bytes ,
int update_size ) ;
2013-05-29 22:54:47 +04:00
int btrfs_cond_migrate_bytes ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * dest , u64 num_bytes ,
int min_factor ) ;
2016-06-23 01:54:24 +03:00
void btrfs_block_rsv_release ( struct btrfs_fs_info * fs_info ,
2010-05-16 18:46:25 +04:00
struct btrfs_block_rsv * block_rsv ,
u64 num_bytes ) ;
2017-02-16 00:28:29 +03:00
int btrfs_inc_block_group_ro ( struct btrfs_fs_info * fs_info ,
2010-05-16 18:46:25 +04:00
struct btrfs_block_group_cache * cache ) ;
2016-06-23 01:54:24 +03:00
void btrfs_dec_block_group_ro ( struct btrfs_block_group_cache * cache ) ;
2010-06-21 22:48:16 +04:00
void btrfs_put_block_group_cache ( struct btrfs_fs_info * info ) ;
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 13:07:31 +03:00
u64 btrfs_account_ro_block_groups_free_space ( struct btrfs_space_info * sinfo ) ;
2016-06-23 01:54:24 +03:00
int btrfs_error_unpin_extent_range ( struct btrfs_fs_info * fs_info ,
2011-01-06 14:30:25 +03:00
u64 start , u64 end ) ;
2016-06-23 01:54:24 +03:00
int btrfs_discard_extent ( struct btrfs_fs_info * fs_info , u64 bytenr ,
2014-12-08 17:01:12 +03:00
u64 num_bytes , u64 * actual_bytes ) ;
2011-02-16 21:57:04 +03:00
int btrfs_force_chunk_alloc ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info , u64 type ) ;
int btrfs_trim_fs ( struct btrfs_fs_info * fs_info , struct fstrim_range * range ) ;
2011-01-06 14:30:25 +03:00
2011-03-07 05:13:14 +03:00
int btrfs_init_space_info ( struct btrfs_fs_info * fs_info ) ;
2012-06-28 20:03:02 +04:00
int btrfs_delayed_refs_qgroup_accounting ( struct btrfs_trans_handle * trans ,
struct btrfs_fs_info * fs_info ) ;
2012-11-21 18:18:10 +04:00
int __get_raid_index ( u64 flags ) ;
2017-06-22 03:19:11 +03:00
int btrfs_start_write_no_snapshotting ( struct btrfs_root * root ) ;
void btrfs_end_write_no_snapshotting ( struct btrfs_root * root ) ;
2016-01-06 13:56:36 +03:00
void btrfs_wait_for_snapshot_creation ( struct btrfs_root * root ) ;
Btrfs: fix -ENOSPC on block group removal
Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.
While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:
[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left
So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-20 16:01:55 +03:00
void check_system_chunk ( struct btrfs_trans_handle * trans ,
2016-06-23 01:54:24 +03:00
struct btrfs_fs_info * fs_info , const u64 type ) ;
2015-09-30 06:50:35 +03:00
u64 add_new_free_space ( struct btrfs_block_group_cache * block_group ,
struct btrfs_fs_info * info , u64 start , u64 end ) ;
2007-03-27 00:00:06 +04:00
/* ctree.c */
2017-01-18 10:24:37 +03:00
int btrfs_bin_search ( struct extent_buffer * eb , const struct btrfs_key * key ,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int level , int * slot ) ;
2017-01-18 10:24:37 +03:00
int btrfs_comp_cpu_keys ( const struct btrfs_key * k1 , const struct btrfs_key * k2 ) ;
2008-03-24 22:01:56 +03:00
int btrfs_previous_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 min_objectid ,
int type ) ;
2014-01-12 17:38:33 +04:00
int btrfs_previous_extent_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 min_objectid ) ;
2014-11-12 07:43:09 +03:00
void btrfs_set_item_key_safe ( struct btrfs_fs_info * fs_info ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * new_key ) ;
2008-06-26 00:01:30 +04:00
struct extent_buffer * btrfs_root_node ( struct btrfs_root * root ) ;
struct extent_buffer * btrfs_lock_root_node ( struct btrfs_root * root ) ;
2017-09-29 22:43:49 +03:00
struct extent_buffer * btrfs_read_lock_root_node ( struct btrfs_root * root ) ;
2008-06-26 00:01:31 +04:00
int btrfs_find_next_key ( struct btrfs_root * root , struct btrfs_path * path ,
2008-06-26 00:01:31 +04:00
struct btrfs_key * key , int lowest_level ,
2013-01-31 22:21:12 +04:00
u64 min_trans ) ;
2008-06-26 00:01:31 +04:00
int btrfs_search_forward ( struct btrfs_root * root , struct btrfs_key * min_key ,
2013-01-31 22:21:12 +04:00
struct btrfs_path * path ,
2008-06-26 00:01:31 +04:00
u64 min_trans ) ;
2012-06-05 23:07:48 +04:00
enum btrfs_compare_tree_result {
BTRFS_COMPARE_TREE_NEW ,
BTRFS_COMPARE_TREE_DELETED ,
BTRFS_COMPARE_TREE_CHANGED ,
2013-08-17 00:52:55 +04:00
BTRFS_COMPARE_TREE_SAME ,
2012-06-05 23:07:48 +04:00
} ;
2017-08-21 12:43:45 +03:00
typedef int ( * btrfs_changed_cb_t ) ( struct btrfs_path * left_path ,
2012-06-05 23:07:48 +04:00
struct btrfs_path * right_path ,
struct btrfs_key * key ,
enum btrfs_compare_tree_result result ,
void * ctx ) ;
int btrfs_compare_trees ( struct btrfs_root * left_root ,
struct btrfs_root * right_root ,
btrfs_changed_cb_t cb , void * ctx ) ;
2007-10-16 00:14:19 +04:00
int btrfs_cow_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct extent_buffer * buf ,
struct extent_buffer * parent , int parent_slot ,
2009-03-13 17:24:59 +03:00
struct extent_buffer * * cow_ret ) ;
2007-12-18 04:14:01 +03:00
int btrfs_copy_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf ,
struct extent_buffer * * cow_ret , u64 new_root_objectid ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int btrfs_block_can_be_shared ( struct btrfs_root * root ,
struct extent_buffer * buf ) ;
2016-06-23 01:54:24 +03:00
void btrfs_extend_item ( struct btrfs_fs_info * fs_info , struct btrfs_path * path ,
2012-03-01 17:56:26 +04:00
u32 data_size ) ;
2016-06-23 01:54:24 +03:00
void btrfs_truncate_item ( struct btrfs_fs_info * fs_info ,
struct btrfs_path * path , u32 new_size , int from_end ) ;
2008-12-10 17:10:46 +03:00
int btrfs_split_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * new_key ,
2008-12-10 17:10:46 +03:00
unsigned long split_offset ) ;
2009-11-12 12:33:58 +03:00
int btrfs_duplicate_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * new_key ) ;
2013-11-05 07:33:33 +04:00
int btrfs_find_item ( struct btrfs_root * fs_root , struct btrfs_path * path ,
u64 inum , u64 ioff , u8 key_type , struct btrfs_key * found_key ) ;
2017-01-18 10:24:37 +03:00
int btrfs_search_slot ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key , struct btrfs_path * p ,
int ins_len , int cow ) ;
int btrfs_search_old_slot ( struct btrfs_root * root , const struct btrfs_key * key ,
2012-05-16 20:25:47 +04:00
struct btrfs_path * p , u64 time_seq ) ;
2011-09-13 13:18:10 +04:00
int btrfs_search_slot_for_read ( struct btrfs_root * root ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * key ,
struct btrfs_path * p , int find_higher ,
int return_any ) ;
2007-08-08 00:15:09 +04:00
int btrfs_realloc_node ( struct btrfs_trans_handle * trans ,
2007-10-16 00:14:19 +04:00
struct btrfs_root * root , struct extent_buffer * parent ,
2013-01-31 22:21:12 +04:00
int start_slot , u64 * last_ret ,
2007-10-16 00:22:39 +04:00
struct btrfs_key * progress ) ;
2011-04-21 03:20:15 +04:00
void btrfs_release_path ( struct btrfs_path * p ) ;
2007-04-02 18:50:19 +04:00
struct btrfs_path * btrfs_alloc_path ( void ) ;
void btrfs_free_path ( struct btrfs_path * p ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
void btrfs_set_path_blocking ( struct btrfs_path * p ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 14:12:22 +04:00
void btrfs_clear_path_blocking ( struct btrfs_path * p ,
2011-07-16 23:23:14 +04:00
struct extent_buffer * held , int held_rw ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 17:25:08 +03:00
void btrfs_unlock_up_safe ( struct btrfs_path * p , int level ) ;
2008-01-29 23:11:36 +03:00
int btrfs_del_items ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
struct btrfs_path * path , int slot , int nr ) ;
static inline int btrfs_del_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path )
{
return btrfs_del_items ( trans , root , path , path - > slots [ 0 ] , 1 ) ;
}
2013-04-16 09:18:22 +04:00
void setup_items_for_insert ( struct btrfs_root * root , struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * cpu_key , u32 * data_size ,
2012-03-01 17:56:26 +04:00
u32 total_data , u32 total_size , int nr ) ;
2017-01-18 10:24:37 +03:00
int btrfs_insert_item ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key , void * data , u32 data_size ) ;
2008-01-29 23:15:18 +03:00
int btrfs_insert_empty_items ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * cpu_key , u32 * data_size ,
int nr ) ;
2008-01-29 23:15:18 +03:00
static inline int btrfs_insert_empty_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-18 10:24:37 +03:00
const struct btrfs_key * key ,
2008-01-29 23:15:18 +03:00
u32 data_size )
{
return btrfs_insert_empty_items ( trans , root , path , key , & data_size , 1 ) ;
}
2007-03-13 17:46:10 +03:00
int btrfs_next_leaf ( struct btrfs_root * root , struct btrfs_path * path ) ;
2013-10-22 20:18:51 +04:00
int btrfs_prev_leaf ( struct btrfs_root * root , struct btrfs_path * path ) ;
2012-06-11 10:29:29 +04:00
int btrfs_next_old_leaf ( struct btrfs_root * root , struct btrfs_path * path ,
u64 time_seq ) ;
2012-06-19 17:42:25 +04:00
static inline int btrfs_next_old_item ( struct btrfs_root * root ,
struct btrfs_path * p , u64 time_seq )
2011-11-22 18:14:33 +04:00
{
+ + p - > slots [ 0 ] ;
if ( p - > slots [ 0 ] > = btrfs_header_nritems ( p - > nodes [ 0 ] ) )
2012-06-19 17:42:25 +04:00
return btrfs_next_old_leaf ( root , p , time_seq ) ;
2011-11-22 18:14:33 +04:00
return 0 ;
}
2012-06-19 17:42:25 +04:00
static inline int btrfs_next_item ( struct btrfs_root * root , struct btrfs_path * p )
{
return btrfs_next_old_item ( root , p , 0 ) ;
}
2016-06-23 01:54:24 +03:00
int btrfs_leaf_free_space ( struct btrfs_fs_info * fs_info ,
struct extent_buffer * leaf ) ;
2011-10-04 07:22:41 +04:00
int __must_check btrfs_drop_snapshot ( struct btrfs_root * root ,
struct btrfs_block_rsv * block_rsv ,
int update_ref , int for_reloc ) ;
2008-10-29 21:49:05 +03:00
int btrfs_drop_subtree ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * node ,
struct extent_buffer * parent ) ;
2011-05-31 20:07:27 +04:00
static inline int btrfs_fs_closing ( struct btrfs_fs_info * fs_info )
{
/*
2016-09-02 22:40:02 +03:00
* Do it this way so we only ever do one test_bit in the normal case .
2011-05-31 20:07:27 +04:00
*/
2016-09-02 22:40:02 +03:00
if ( test_bit ( BTRFS_FS_CLOSING_START , & fs_info - > flags ) ) {
if ( test_bit ( BTRFS_FS_CLOSING_DONE , & fs_info - > flags ) )
return 2 ;
return 1 ;
}
return 0 ;
2011-05-31 20:07:27 +04:00
}
2013-05-14 14:20:43 +04:00
/*
* If we remount the fs to be R / O or umount the fs , the cleaner needn ' t do
* anything except sleeping . This function is used to check the status of
* the fs .
*/
2016-06-23 01:54:24 +03:00
static inline int btrfs_need_cleaner_sleep ( struct btrfs_fs_info * fs_info )
2013-05-14 14:20:43 +04:00
{
2017-11-28 00:05:09 +03:00
return fs_info - > sb - > s_flags & SB_RDONLY | | btrfs_fs_closing ( fs_info ) ;
2013-05-14 14:20:43 +04:00
}
2011-04-13 17:41:04 +04:00
static inline void free_fs_info ( struct btrfs_fs_info * fs_info )
{
2012-01-17 00:04:49 +04:00
kfree ( fs_info - > balance_ctl ) ;
2011-04-13 17:41:04 +04:00
kfree ( fs_info - > delayed_root ) ;
kfree ( fs_info - > extent_root ) ;
kfree ( fs_info - > tree_root ) ;
kfree ( fs_info - > chunk_root ) ;
kfree ( fs_info - > dev_root ) ;
kfree ( fs_info - > csum_root ) ;
2011-09-13 17:23:30 +04:00
kfree ( fs_info - > quota_root ) ;
2013-08-24 22:51:06 +04:00
kfree ( fs_info - > uuid_root ) ;
2015-09-30 06:50:38 +03:00
kfree ( fs_info - > free_space_root ) ;
2011-04-13 17:41:04 +04:00
kfree ( fs_info - > super_copy ) ;
kfree ( fs_info - > super_for_commit ) ;
2014-09-23 09:40:08 +04:00
security_free_mnt_opts ( & fs_info - > security_opts ) ;
2018-02-16 06:59:47 +03:00
kvfree ( fs_info ) ;
2011-04-13 17:41:04 +04:00
}
2011-05-31 20:07:27 +04:00
2012-06-21 13:08:04 +04:00
/* tree mod log functions from ctree.c */
u64 btrfs_get_tree_mod_seq ( struct btrfs_fs_info * fs_info ,
struct seq_list * elem ) ;
void btrfs_put_tree_mod_seq ( struct btrfs_fs_info * fs_info ,
struct seq_list * elem ) ;
2012-10-23 13:28:27 +04:00
int btrfs_old_root_level ( struct btrfs_root * root , u64 time_seq ) ;
2012-06-21 13:08:04 +04:00
2007-03-27 00:00:06 +04:00
/* root-item.c */
2008-11-18 04:37:39 +03:00
int btrfs_add_root_ref ( struct btrfs_trans_handle * trans ,
2016-06-22 04:16:51 +03:00
struct btrfs_fs_info * fs_info ,
2009-09-21 23:56:00 +04:00
u64 root_id , u64 ref_id , u64 dirid , u64 sequence ,
const char * name , int name_len ) ;
int btrfs_del_root_ref ( struct btrfs_trans_handle * trans ,
2016-06-22 04:16:51 +03:00
struct btrfs_fs_info * fs_info ,
2009-09-21 23:56:00 +04:00
u64 root_id , u64 ref_id , u64 dirid , u64 * sequence ,
2008-11-18 04:37:39 +03:00
const char * name , int name_len ) ;
2017-08-17 17:25:11 +03:00
int btrfs_del_root ( struct btrfs_trans_handle * trans ,
struct btrfs_fs_info * fs_info , const struct btrfs_key * key ) ;
2017-01-18 10:24:37 +03:00
int btrfs_insert_root ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key ,
struct btrfs_root_item * item ) ;
2011-10-04 07:22:44 +04:00
int __must_check btrfs_update_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_key * key ,
struct btrfs_root_item * item ) ;
2017-01-18 10:24:37 +03:00
int btrfs_find_root ( struct btrfs_root * root , const struct btrfs_key * search_key ,
2013-05-15 11:48:19 +04:00
struct btrfs_path * path , struct btrfs_root_item * root_item ,
struct btrfs_key * root_key ) ;
2016-06-22 04:16:51 +03:00
int btrfs_find_orphan_roots ( struct btrfs_fs_info * fs_info ) ;
2011-07-15 01:23:06 +04:00
void btrfs_set_root_node ( struct btrfs_root_item * item ,
struct extent_buffer * node ) ;
2011-03-28 06:01:25 +04:00
void btrfs_check_and_init_root_item ( struct btrfs_root_item * item ) ;
2012-07-25 19:35:53 +04:00
void btrfs_update_root_times ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
2011-03-28 06:01:25 +04:00
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 19:11:17 +04:00
/* uuid-tree.c */
int btrfs_uuid_tree_add ( struct btrfs_trans_handle * trans ,
2016-06-22 04:16:51 +03:00
struct btrfs_fs_info * fs_info , u8 * uuid , u8 type ,
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 19:11:17 +04:00
u64 subid ) ;
int btrfs_uuid_tree_rem ( struct btrfs_trans_handle * trans ,
2016-06-22 04:16:51 +03:00
struct btrfs_fs_info * fs_info , u8 * uuid , u8 type ,
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 19:11:17 +04:00
u64 subid ) ;
2013-08-15 19:11:23 +04:00
int btrfs_uuid_tree_iterate ( struct btrfs_fs_info * fs_info ,
int ( * check_func ) ( struct btrfs_fs_info * , u8 * , u8 ,
u64 ) ) ;
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 19:11:17 +04:00
2007-03-27 00:00:06 +04:00
/* dir-item.c */
2012-12-17 23:26:57 +04:00
int btrfs_check_dir_item_collision ( struct btrfs_root * root , u64 dir ,
const char * name , int name_len ) ;
2009-01-06 05:25:51 +03:00
int btrfs_insert_dir_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , const char * name ,
2017-02-20 14:50:31 +03:00
int name_len , struct btrfs_inode * dir ,
2008-07-24 20:12:38 +04:00
struct btrfs_key * location , u8 type , u64 index ) ;
2007-04-19 23:36:27 +04:00
struct btrfs_dir_item * btrfs_lookup_dir_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 dir ,
const char * name , int name_len ,
int mod ) ;
struct btrfs_dir_item *
btrfs_lookup_dir_index_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 dir ,
u64 objectid , const char * name , int name_len ,
int mod ) ;
2009-09-21 23:56:00 +04:00
struct btrfs_dir_item *
btrfs_search_dir_index_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 dirid ,
const char * name , int name_len ) ;
2007-04-19 23:36:27 +04:00
int btrfs_delete_one_dir_name ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
struct btrfs_dir_item * di ) ;
2007-11-16 19:45:54 +03:00
int btrfs_insert_xattr_item ( struct btrfs_trans_handle * trans ,
2009-11-12 12:35:27 +03:00
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ,
const char * name , u16 name_len ,
const void * data , u16 data_len ) ;
2007-11-16 19:45:54 +03:00
struct btrfs_dir_item * btrfs_lookup_xattr ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 dir ,
const char * name , u16 name_len ,
int mod ) ;
2016-06-23 01:54:24 +03:00
struct btrfs_dir_item * btrfs_match_dir_item_name ( struct btrfs_fs_info * fs_info ,
2014-11-09 11:38:39 +03:00
struct btrfs_path * path ,
const char * name ,
int name_len ) ;
2008-07-24 20:17:14 +04:00
/* orphan.c */
int btrfs_insert_orphan_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , u64 offset ) ;
int btrfs_del_orphan_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , u64 offset ) ;
2009-09-21 23:56:00 +04:00
int btrfs_find_orphan_item ( struct btrfs_root * root , u64 offset ) ;
2008-07-24 20:17:14 +04:00
2007-03-27 00:00:06 +04:00
/* inode-item.c */
2007-12-12 22:38:19 +03:00
int btrfs_insert_inode_ref ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
const char * name , int name_len ,
2008-07-24 20:12:38 +04:00
u64 inode_objectid , u64 ref_objectid , u64 index ) ;
2007-12-12 22:38:19 +03:00
int btrfs_del_inode_ref ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
const char * name , int name_len ,
2008-07-24 20:12:38 +04:00
u64 inode_objectid , u64 ref_objectid , u64 * index ) ;
2007-10-16 00:14:19 +04:00
int btrfs_insert_empty_inode ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ) ;
2007-03-20 22:57:25 +03:00
int btrfs_lookup_inode ( struct btrfs_trans_handle * trans , struct btrfs_root
2007-04-06 23:37:36 +04:00
* root , struct btrfs_path * path ,
struct btrfs_key * location , int mod ) ;
2007-03-27 00:00:06 +04:00
2012-08-08 22:32:27 +04:00
struct btrfs_inode_extref *
btrfs_lookup_inode_extref ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
const char * name , int name_len ,
u64 inode_objectid , u64 ref_objectid , int ins_len ,
int cow ) ;
Btrfs: fix log replay failure after unlink and link combination
If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/foo
$ ln /mnt/testdir/foo /mnt/testdir/bar
$ sync
$ unlink /mnt/testdir/bar
$ touch /mnt/testdir/bar
$ xfs_io -c "fsync" /mnt/testdir/bar
<power failure>
$ mount /dev/sdb /mnt
mount: mount(2) failed: /mnt: No such file or directory
When replaying the log, for that example, we also see the following in
dmesg/syslog:
[71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
[71813.674204] ------------[ cut here ]------------
[71813.675694] BTRFS: Transaction aborted (error -2)
[71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
[71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1
[71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
[71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
[71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
[71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
[71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
[71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
[71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
[71813.679669] Call Trace:
[71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs]
[71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs]
[71813.679669] add_inode_ref+0x71e/0x851 [btrfs]
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs]
[71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs]
[71813.679669] ? rcu_read_unlock+0x3a/0x57
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs]
[71813.679669] walk_log_tree+0xad/0x188 [btrfs]
[71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
[71813.679669] ? replay_one_extent+0x544/0x544 [btrfs]
[71813.679669] open_ctree+0x1cf6/0x2209 [btrfs]
[71813.679669] btrfs_mount_root+0x368/0x482 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] btrfs_mount+0x13e/0x772 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] do_mount+0x6e5/0x973
[71813.679669] ? memdup_user+0x3e/0x5c
[71813.679669] SyS_mount+0x72/0x98
[71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b
[71813.679669] RIP: 0033:0x7f7cedf150ba
[71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
[71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
[71813.679669] ---[ end trace 83bd473fc5b4663b ]---
[71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
[71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
[71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
[71814.128078] BTRFS error (device dm-0): open_ctree failed
This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.
Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-28 18:56:10 +03:00
int btrfs_find_name_in_backref ( struct extent_buffer * leaf , int slot ,
const char * name ,
int name_len , struct btrfs_inode_ref * * ref_ret ) ;
int btrfs_find_name_in_ext_backref ( struct extent_buffer * leaf , int slot ,
2012-08-08 22:32:27 +04:00
u64 ref_objectid , const char * name ,
int name_len ,
struct btrfs_inode_extref * * extref_ret ) ;
2007-03-27 00:00:06 +04:00
/* file-item.c */
2013-07-25 15:22:34 +04:00
struct btrfs_dio_private ;
2008-12-10 17:10:46 +03:00
int btrfs_del_csums ( struct btrfs_trans_handle * trans ,
2016-06-21 17:40:19 +03:00
struct btrfs_fs_info * fs_info , u64 bytenr , u64 len ) ;
2017-06-03 10:38:06 +03:00
blk_status_t btrfs_lookup_bio_sums ( struct inode * inode , struct bio * bio , u32 * dst ) ;
blk_status_t btrfs_lookup_bio_sums_dio ( struct inode * inode , struct bio * bio ,
2016-06-23 01:54:24 +03:00
u64 logical_offset ) ;
2007-04-17 21:26:50 +04:00
int btrfs_insert_file_extent ( struct btrfs_trans_handle * trans ,
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
struct btrfs_root * root ,
u64 objectid , u64 pos ,
u64 disk_offset , u64 disk_num_bytes ,
u64 num_bytes , u64 offset , u64 ram_bytes ,
u8 compression , u8 encryption , u16 other_encoding ) ;
2007-03-27 00:00:06 +04:00
int btrfs_lookup_file_extent ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ,
2007-10-16 00:15:53 +04:00
u64 bytenr , int mod ) ;
2008-02-20 20:07:25 +03:00
int btrfs_csum_file_blocks ( struct btrfs_trans_handle * trans ,
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 00:58:54 +03:00
struct btrfs_root * root ,
2008-07-17 20:53:50 +04:00
struct btrfs_ordered_sum * sums ) ;
2017-06-03 10:38:06 +03:00
blk_status_t btrfs_csum_one_bio ( struct inode * inode , struct bio * bio ,
2016-06-23 01:54:24 +03:00
u64 file_start , int contig ) ;
2011-03-08 16:14:00 +03:00
int btrfs_lookup_csums_range ( struct btrfs_root * root , u64 start , u64 end ,
struct list_head * list , int search_commit ) ;
2017-02-20 14:51:02 +03:00
void btrfs_extent_item_to_extent_map ( struct btrfs_inode * inode ,
2014-06-09 06:48:05 +04:00
const struct btrfs_path * path ,
struct btrfs_file_extent_item * fi ,
const bool new_inline ,
struct extent_map * em ) ;
2007-06-12 14:35:45 +04:00
/* inode.c */
2012-10-25 13:28:04 +04:00
struct btrfs_delalloc_work {
struct inode * inode ;
int delay_iput ;
struct completion completion ;
struct list_head list ;
struct btrfs_work work ;
} ;
struct btrfs_delalloc_work * btrfs_alloc_delalloc_work ( struct inode * inode ,
2015-11-27 21:24:16 +03:00
int delay_iput ) ;
2012-10-25 13:28:04 +04:00
void btrfs_wait_and_free_delalloc_work ( struct btrfs_delalloc_work * work ) ;
2017-02-20 14:51:06 +03:00
struct extent_map * btrfs_get_extent_fiemap ( struct btrfs_inode * inode ,
struct page * page , size_t pg_offset , u64 start ,
u64 len , int create ) ;
2013-08-14 22:02:47 +04:00
noinline int can_nocow_extent ( struct inode * inode , u64 offset , u64 * len ,
2013-06-22 00:37:03 +04:00
u64 * orig_start , u64 * orig_block_len ,
u64 * ram_bytes ) ;
2008-07-24 17:51:08 +04:00
2008-11-18 05:02:50 +03:00
struct inode * btrfs_lookup_dentry ( struct inode * dir , struct dentry * dentry ) ;
2017-02-20 14:50:35 +03:00
int btrfs_set_inode_index ( struct btrfs_inode * dir , u64 * index ) ;
2008-09-06 00:13:11 +04:00
int btrfs_unlink_inode ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
2017-01-18 01:31:44 +03:00
struct btrfs_inode * dir , struct btrfs_inode * inode ,
2008-09-06 00:13:11 +04:00
const char * name , int name_len ) ;
int btrfs_add_link ( struct btrfs_trans_handle * trans ,
2017-02-20 14:51:08 +03:00
struct btrfs_inode * parent_inode , struct btrfs_inode * inode ,
2008-09-06 00:13:11 +04:00
const char * name , int name_len , int add_backref , u64 index ) ;
2009-09-21 23:56:00 +04:00
int btrfs_unlink_subvol ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct inode * dir , u64 objectid ,
const char * name , int name_len ) ;
2016-01-21 13:25:56 +03:00
int btrfs_truncate_block ( struct inode * inode , loff_t from , loff_t len ,
2012-08-29 22:27:18 +04:00
int front ) ;
2008-09-06 00:13:11 +04:00
int btrfs_truncate_inode_items ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct inode * inode , u64 new_size ,
u32 min_type ) ;
2009-11-12 12:36:34 +03:00
int btrfs_start_delalloc_inodes ( struct btrfs_root * root , int delay_iput ) ;
2014-03-06 09:55:01 +04:00
int btrfs_start_delalloc_roots ( struct btrfs_fs_info * fs_info , int delay_iput ,
int nr ) ;
2010-02-03 22:33:23 +03:00
int btrfs_set_extent_delalloc ( struct inode * inode , u64 start , u64 end ,
2017-11-04 03:16:59 +03:00
unsigned int extra_bits ,
2016-07-19 11:50:36 +03:00
struct extent_state * * cached_state , int dedupe ) ;
2008-12-12 00:30:39 +03:00
int btrfs_create_subvol_root ( struct btrfs_trans_handle * trans ,
Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-07 15:47:46 +04:00
struct btrfs_root * new_root ,
struct btrfs_root * parent_root ,
u64 new_dirid ) ;
2016-06-05 22:31:54 +03:00
int btrfs_merge_bio_hook ( struct page * page , unsigned long offset ,
2009-07-16 02:29:37 +04:00
size_t size , struct bio * bio ,
unsigned long bio_flags ) ;
2017-05-05 18:57:13 +03:00
void btrfs_set_range_writeback ( void * private_data , u64 start , u64 end ) ;
2017-02-25 01:56:41 +03:00
int btrfs_page_mkwrite ( struct vm_fault * vmf ) ;
2007-06-15 21:50:00 +04:00
int btrfs_readpage ( struct file * file , struct page * page ) ;
2010-06-07 19:35:40 +04:00
void btrfs_evict_inode ( struct inode * inode ) ;
2010-03-05 11:21:37 +03:00
int btrfs_write_inode ( struct inode * inode , struct writeback_control * wbc ) ;
2007-06-12 14:35:45 +04:00
struct inode * btrfs_alloc_inode ( struct super_block * sb ) ;
void btrfs_destroy_inode ( struct inode * inode ) ;
2010-06-07 21:43:19 +04:00
int btrfs_drop_inode ( struct inode * inode ) ;
2017-11-03 02:21:50 +03:00
int __init btrfs_init_cachep ( void ) ;
2007-06-12 14:35:45 +04:00
void btrfs_destroy_cachep ( void ) ;
2008-06-10 18:07:39 +04:00
long btrfs_ioctl_trans_end ( struct file * file ) ;
2008-07-21 00:31:04 +04:00
struct inode * btrfs_iget ( struct super_block * s , struct btrfs_key * location ,
Btrfs: change how we mount subvolumes
This work is in preperation for being able to set a different root as the
default mounting root.
There is currently a problem with how we mount subvolumes. We cannot currently
mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
default subvolume. So say you take a snapshot of the default subvolume and call
it snap1, and then take a snapshot of snap1 and call it snap2, so now you have
/
/snap1
/snap1/snap2
as your available volumes. Currently you can only mount / and /snap1,
you cannot mount /snap1/snap2. To fix this problem instead of passing
subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
the tree id that gets spit out via the subvolume listing you get from
the subvolume listing patches (btrfs filesystem list). This allows us
to mount /, /snap1 and /snap1/snap2 as the root volume.
In addition to the above, we also now read the default dir item in the
tree root to get the root key that it points to. For now this just
points at what has always been the default subvolme, but later on I plan
to change it to point at whatever root you want to be the new default
root, so you can just set the default mount and not have to mount with
-o subvolid=<treeid>. I tested this out with the above scenario and it
worked perfectly. Thanks,
mount -o subvol operates inside the selected subvolid. For example:
mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt
/mnt will have the snap1 directory for the subvolume with id
256.
mount -o subvol=snap /dev/xxx /mnt
/mnt will be the snap directory of whatever the default subvolume
is.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-04 20:38:27 +03:00
struct btrfs_root * root , int * was_new ) ;
2017-02-20 14:51:06 +03:00
struct extent_map * btrfs_get_extent ( struct btrfs_inode * inode ,
struct page * page , size_t pg_offset ,
u64 start , u64 end , int create ) ;
2007-08-28 00:49:44 +04:00
int btrfs_update_inode ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct inode * inode ) ;
2012-10-22 23:43:12 +04:00
int btrfs_update_inode_fallback ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct inode * inode ) ;
2017-02-20 14:50:59 +03:00
int btrfs_orphan_add ( struct btrfs_trans_handle * trans ,
struct btrfs_inode * inode ) ;
2011-02-01 00:22:42 +03:00
int btrfs_orphan_cleanup ( struct btrfs_root * root ) ;
2010-05-16 18:49:58 +04:00
void btrfs_orphan_commit_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
2011-01-31 23:30:16 +03:00
int btrfs_cont_expand ( struct inode * inode , loff_t oldsize , loff_t size ) ;
2012-03-01 17:56:26 +04:00
void btrfs_invalidate_inodes ( struct btrfs_root * root ) ;
2009-11-12 12:36:34 +03:00
void btrfs_add_delayed_iput ( struct inode * inode ) ;
2016-06-23 01:54:24 +03:00
void btrfs_run_delayed_iputs ( struct btrfs_fs_info * fs_info ) ;
2010-05-16 18:49:59 +04:00
int btrfs_prealloc_file_range ( struct inode * inode , int mode ,
u64 start , u64 num_bytes , u64 min_size ,
loff_t actual_len , u64 * alloc_hint ) ;
2010-06-21 22:48:16 +04:00
int btrfs_prealloc_file_range_trans ( struct inode * inode ,
struct btrfs_trans_handle * trans , int mode ,
u64 start , u64 num_bytes , u64 min_size ,
loff_t actual_len , u64 * alloc_hint ) ;
2009-10-09 17:54:36 +04:00
extern const struct dentry_operations btrfs_dentry_operations ;
2015-03-17 00:38:52 +03:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
void btrfs_test_inode_set_ops ( struct inode * inode ) ;
# endif
2008-06-12 05:53:53 +04:00
/* ioctl.c */
long btrfs_ioctl ( struct file * file , unsigned int cmd , unsigned long arg ) ;
2015-10-29 11:22:21 +03:00
long btrfs_compat_ioctl ( struct file * file , unsigned int cmd , unsigned long arg ) ;
2016-02-17 17:26:27 +03:00
int btrfs_ioctl_get_supported_features ( void __user * arg ) ;
2009-04-17 12:37:41 +04:00
void btrfs_update_iflags ( struct inode * inode ) ;
2013-08-15 19:11:20 +04:00
int btrfs_is_empty_uuid ( u8 * uuid ) ;
2011-05-24 23:35:30 +04:00
int btrfs_defrag_file ( struct inode * inode , struct file * file ,
struct btrfs_ioctl_defrag_range_args * range ,
u64 newer_than , unsigned long max_pages ) ;
2012-08-01 20:56:49 +04:00
void btrfs_get_block_group_info ( struct list_head * groups_list ,
struct btrfs_ioctl_space_info * space ) ;
2013-08-14 20:12:25 +04:00
void update_ioctl_balance_args ( struct btrfs_fs_info * fs_info , int lock ,
struct btrfs_ioctl_balance_args * bargs ) ;
2015-12-19 11:56:05 +03:00
ssize_t btrfs_dedupe_file_range ( struct file * src_file , u64 loff , u64 olen ,
struct file * dst_file , u64 dst_loff ) ;
2013-08-14 20:12:25 +04:00
2007-06-12 14:35:45 +04:00
/* file.c */
2017-11-03 02:21:50 +03:00
int __init btrfs_auto_defrag_init ( void ) ;
2012-11-26 13:24:43 +04:00
void btrfs_auto_defrag_exit ( void ) ;
2011-05-24 23:35:30 +04:00
int btrfs_add_inode_defrag ( struct btrfs_trans_handle * trans ,
2017-02-20 14:50:43 +03:00
struct btrfs_inode * inode ) ;
2011-05-24 23:35:30 +04:00
int btrfs_run_defrag_inodes ( struct btrfs_fs_info * fs_info ) ;
2012-11-26 13:26:20 +04:00
void btrfs_cleanup_defrag_inodes ( struct btrfs_fs_info * fs_info ) ;
2011-07-17 04:44:56 +04:00
int btrfs_sync_file ( struct file * file , loff_t start , loff_t end , int datasync ) ;
2017-02-20 14:50:45 +03:00
void btrfs_drop_extent_cache ( struct btrfs_inode * inode , u64 start , u64 end ,
2012-08-31 04:06:49 +04:00
int skip_pinned ) ;
2009-10-02 02:43:56 +04:00
extern const struct file_operations btrfs_file_operations ;
Btrfs: turbo charge fsync
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 21:14:17 +04:00
int __btrfs_drop_extents ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct inode * inode ,
struct btrfs_path * path , u64 start , u64 end ,
2014-01-07 15:42:27 +04:00
u64 * drop_end , int drop_cache ,
int replace_extent ,
u32 extent_item_size ,
int * key_inserted ) ;
Btrfs: turbo charge fsync
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 21:14:17 +04:00
int btrfs_drop_extents ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct inode * inode , u64 start ,
2012-08-29 20:24:27 +04:00
u64 end , int drop_cache ) ;
2008-10-30 21:25:28 +03:00
int btrfs_mark_extent_written ( struct btrfs_trans_handle * trans ,
2017-02-20 14:50:48 +03:00
struct btrfs_inode * inode , u64 start , u64 end ) ;
2008-06-10 18:07:39 +04:00
int btrfs_release_file ( struct inode * inode , struct file * file ) ;
2016-06-23 01:54:24 +03:00
int btrfs_dirty_pages ( struct inode * inode , struct page * * pages ,
size_t num_pages , loff_t pos , size_t write_bytes ,
2011-04-06 21:05:22 +04:00
struct extent_state * * cached ) ;
2014-10-10 12:43:11 +04:00
int btrfs_fdatawrite_range ( struct inode * inode , loff_t start , loff_t end ) ;
2015-12-03 14:59:50 +03:00
int btrfs_clone_file_range ( struct file * file_in , loff_t pos_in ,
struct file * file_out , loff_t pos_out , u64 len ) ;
2008-06-10 18:07:39 +04:00
2007-08-08 00:15:09 +04:00
/* tree-defrag.c */
int btrfs_defrag_leaves ( struct btrfs_trans_handle * trans ,
2013-01-31 22:21:12 +04:00
struct btrfs_root * root ) ;
2007-08-29 23:47:34 +04:00
/* sysfs.c */
2017-11-03 02:21:50 +03:00
int __init btrfs_init_sysfs ( void ) ;
2007-08-29 23:47:34 +04:00
void btrfs_exit_sysfs ( void ) ;
2015-08-14 13:32:46 +03:00
int btrfs_sysfs_add_mounted ( struct btrfs_fs_info * fs_info ) ;
2015-08-14 13:32:47 +03:00
void btrfs_sysfs_remove_mounted ( struct btrfs_fs_info * fs_info ) ;
2007-08-29 23:47:34 +04:00
2007-11-16 19:45:54 +03:00
/* xattr.c */
ssize_t btrfs_listxattr ( struct dentry * dentry , char * buffer , size_t size ) ;
2008-07-24 20:16:03 +04:00
2007-12-22 00:27:24 +03:00
/* super.c */
2016-06-23 01:54:24 +03:00
int btrfs_parse_options ( struct btrfs_fs_info * info , char * options ,
2016-01-19 05:23:03 +03:00
unsigned long new_flags ) ;
2008-06-10 18:07:39 +04:00
int btrfs_sync_fs ( struct super_block * sb , int wait ) ;
2012-07-31 01:40:13 +04:00
2016-09-23 19:05:21 +03:00
static inline __printf ( 2 , 3 )
void btrfs_no_printk ( const struct btrfs_fs_info * fs_info , const char * fmt , . . . )
{
}
2012-07-31 01:40:13 +04:00
# ifdef CONFIG_PRINTK
__printf ( 2 , 3 )
2013-03-20 02:41:23 +04:00
void btrfs_printk ( const struct btrfs_fs_info * fs_info , const char * fmt , . . . ) ;
2012-07-31 01:40:13 +04:00
# else
2016-09-23 19:05:21 +03:00
# define btrfs_printk(fs_info, fmt, args...) \
btrfs_no_printk ( fs_info , fmt , # # args )
2012-07-31 01:40:13 +04:00
# endif
2013-03-20 02:41:23 +04:00
# define btrfs_emerg(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_INFO fmt , # # args )
2013-11-13 04:22:53 +04:00
2015-10-08 09:48:52 +03:00
/*
* Wrappers that use printk_in_rcu
*/
# define btrfs_emerg_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_INFO fmt , # # args )
2015-10-08 11:27:02 +03:00
/*
* Wrappers that use a ratelimited printk_in_rcu
*/
# define btrfs_emerg_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_INFO fmt , # # args )
2015-10-08 11:51:11 +03:00
/*
* Wrappers that use a ratelimited printk
*/
# define btrfs_emerg_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_INFO fmt , # # args )
2016-09-01 06:55:33 +03:00
# if defined(CONFIG_DYNAMIC_DEBUG)
# define btrfs_debug(fs_info, fmt, args...) \
do { \
DEFINE_DYNAMIC_DEBUG_METADATA ( descriptor , fmt ) ; \
if ( unlikely ( descriptor . flags & _DPRINTK_FLAGS_PRINT ) ) \
btrfs_printk ( fs_info , KERN_DEBUG fmt , # # args ) ; \
} while ( 0 )
# define btrfs_debug_in_rcu(fs_info, fmt, args...) \
do { \
DEFINE_DYNAMIC_DEBUG_METADATA ( descriptor , fmt ) ; \
if ( unlikely ( descriptor . flags & _DPRINTK_FLAGS_PRINT ) ) \
btrfs_printk_in_rcu ( fs_info , KERN_DEBUG fmt , # # args ) ; \
} while ( 0 )
# define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
do { \
DEFINE_DYNAMIC_DEBUG_METADATA ( descriptor , fmt ) ; \
if ( unlikely ( descriptor . flags & _DPRINTK_FLAGS_PRINT ) ) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_DEBUG fmt , \
# #args);\
} while ( 0 )
# define btrfs_debug_rl(fs_info, fmt, args...) \
do { \
DEFINE_DYNAMIC_DEBUG_METADATA ( descriptor , fmt ) ; \
if ( unlikely ( descriptor . flags & _DPRINTK_FLAGS_PRINT ) ) \
btrfs_printk_ratelimited ( fs_info , KERN_DEBUG fmt , \
# #args); \
} while ( 0 )
# elif defined(DEBUG)
2013-03-20 02:41:23 +04:00
# define btrfs_debug(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 09:48:52 +03:00
# define btrfs_debug_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 11:27:02 +03:00
# define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 11:51:11 +03:00
# define btrfs_debug_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_DEBUG fmt , # # args )
2013-11-13 04:22:53 +04:00
# else
# define btrfs_debug(fs_info, fmt, args...) \
2016-09-21 19:17:37 +03:00
btrfs_no_printk ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 09:48:52 +03:00
# define btrfs_debug_in_rcu(fs_info, fmt, args...) \
2016-09-21 19:17:37 +03:00
btrfs_no_printk ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 11:27:02 +03:00
# define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
2016-09-21 19:17:37 +03:00
btrfs_no_printk ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 11:51:11 +03:00
# define btrfs_debug_rl(fs_info, fmt, args...) \
2016-09-21 19:17:37 +03:00
btrfs_no_printk ( fs_info , KERN_DEBUG fmt , # # args )
2013-11-13 04:22:53 +04:00
# endif
2013-03-20 02:41:23 +04:00
2015-10-08 09:48:52 +03:00
# define btrfs_printk_in_rcu(fs_info, fmt, args...) \
do { \
rcu_read_lock ( ) ; \
btrfs_printk ( fs_info , fmt , # # args ) ; \
rcu_read_unlock ( ) ; \
} while ( 0 )
2015-10-08 11:27:02 +03:00
# define btrfs_printk_ratelimited(fs_info, fmt, args...) \
do { \
static DEFINE_RATELIMIT_STATE ( _rs , \
DEFAULT_RATELIMIT_INTERVAL , \
DEFAULT_RATELIMIT_BURST ) ; \
if ( __ratelimit ( & _rs ) ) \
btrfs_printk ( fs_info , fmt , # # args ) ; \
} while ( 0 )
# define btrfs_printk_rl_in_rcu(fs_info, fmt, args...) \
do { \
rcu_read_lock ( ) ; \
btrfs_printk_ratelimited ( fs_info , fmt , # # args ) ; \
rcu_read_unlock ( ) ; \
} while ( 0 )
2013-08-27 00:53:15 +04:00
# ifdef CONFIG_BTRFS_ASSERT
2015-04-24 20:11:57 +03:00
__cold
2013-08-27 00:53:15 +04:00
static inline void assfail ( char * expr , char * file , int line )
{
2016-09-20 17:05:02 +03:00
pr_err ( " assertion failed: %s, file: %s, line: %d \n " ,
2013-08-27 00:53:15 +04:00
expr , file , line ) ;
BUG ( ) ;
}
# define ASSERT(expr) \
( likely ( expr ) ? ( void ) 0 : assfail ( # expr , __FILE__ , __LINE__ ) )
# else
# define ASSERT(expr) ((void)0)
# endif
2012-07-31 01:40:13 +04:00
__printf ( 5 , 6 )
2015-04-24 20:11:57 +03:00
__cold
2016-03-16 11:43:06 +03:00
void __btrfs_handle_fs_error ( struct btrfs_fs_info * fs_info , const char * function ,
2012-03-01 17:57:30 +04:00
unsigned int line , int errno , const char * fmt , . . . ) ;
2011-01-06 14:30:25 +03:00
2015-06-15 16:41:19 +03:00
const char * btrfs_decode_error ( int errno ) ;
2012-07-31 01:40:13 +04:00
2015-04-24 20:11:57 +03:00
__cold
2012-03-01 20:24:58 +04:00
void __btrfs_abort_transaction ( struct btrfs_trans_handle * trans ,
2016-06-11 01:19:25 +03:00
const char * function ,
2012-03-01 20:24:58 +04:00
unsigned int line , int errno ) ;
2016-03-16 11:43:08 +03:00
/*
* Call btrfs_abort_transaction as early as possible when an error condition is
* detected , that way the exact line number is reported .
*/
2016-06-11 01:19:25 +03:00
# define btrfs_abort_transaction(trans, errno) \
2016-03-16 11:43:08 +03:00
do { \
/* Report first abort since mount */ \
if ( ! test_and_set_bit ( BTRFS_FS_STATE_TRANS_ABORTED , \
2016-06-11 01:19:25 +03:00
& ( ( trans ) - > fs_info - > fs_state ) ) ) { \
2016-12-09 16:56:33 +03:00
if ( ( errno ) ! = - EIO ) { \
WARN ( 1 , KERN_DEBUG \
" BTRFS: Transaction aborted (error %d) \n " , \
( errno ) ) ; \
} else { \
2017-02-16 00:28:34 +03:00
btrfs_debug ( ( trans ) - > fs_info , \
" Transaction aborted (error %d) " , \
2016-12-09 16:56:33 +03:00
( errno ) ) ; \
} \
2016-03-16 11:43:08 +03:00
} \
2016-06-11 01:19:25 +03:00
__btrfs_abort_transaction ( ( trans ) , __func__ , \
2016-03-16 11:43:08 +03:00
__LINE__ , ( errno ) ) ; \
} while ( 0 )
# define btrfs_handle_fs_error(fs_info, errno, fmt, args...) \
do { \
__btrfs_handle_fs_error ( ( fs_info ) , __func__ , __LINE__ , \
( errno ) , fmt , # # args ) ; \
} while ( 0 )
__printf ( 5 , 6 )
__cold
void __btrfs_panic ( struct btrfs_fs_info * fs_info , const char * function ,
unsigned int line , int errno , const char * fmt , . . . ) ;
/*
* If BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is in mount_opt , __btrfs_panic
* will panic ( ) . Otherwise we BUG ( ) here .
*/
# define btrfs_panic(fs_info, errno, fmt, args...) \
do { \
__btrfs_panic ( fs_info , __func__ , __LINE__ , errno , fmt , # # args ) ; \
BUG ( ) ; \
} while ( 0 )
/* compatibility and incompatibility defines */
2012-07-24 21:58:43 +04:00
# define btrfs_set_fs_incompat(__fs_info, opt) \
__btrfs_set_fs_incompat ( ( __fs_info ) , BTRFS_FEATURE_INCOMPAT_ # # opt )
static inline void __btrfs_set_fs_incompat ( struct btrfs_fs_info * fs_info ,
u64 flag )
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
2013-04-11 14:30:16 +04:00
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
features | = flag ;
btrfs_set_super_incompat_flags ( disk_super , features ) ;
2013-12-20 20:37:06 +04:00
btrfs_info ( fs_info , " setting %llu feature flag " ,
2013-04-11 14:30:16 +04:00
flag ) ;
}
spin_unlock ( & fs_info - > super_lock ) ;
2012-07-24 21:58:43 +04:00
}
}
2015-09-30 06:50:32 +03:00
# define btrfs_clear_fs_incompat(__fs_info, opt) \
__btrfs_clear_fs_incompat ( ( __fs_info ) , BTRFS_FEATURE_INCOMPAT_ # # opt )
static inline void __btrfs_clear_fs_incompat ( struct btrfs_fs_info * fs_info ,
u64 flag )
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( features & flag ) {
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( features & flag ) {
features & = ~ flag ;
btrfs_set_super_incompat_flags ( disk_super , features ) ;
btrfs_info ( fs_info , " clearing %llu feature flag " ,
flag ) ;
}
spin_unlock ( & fs_info - > super_lock ) ;
}
}
2013-03-07 23:22:04 +04:00
# define btrfs_fs_incompat(fs_info, opt) \
__btrfs_fs_incompat ( ( fs_info ) , BTRFS_FEATURE_INCOMPAT_ # # opt )
2015-10-19 00:35:41 +03:00
static inline bool __btrfs_fs_incompat ( struct btrfs_fs_info * fs_info , u64 flag )
2013-03-07 23:22:04 +04:00
{
struct btrfs_super_block * disk_super ;
disk_super = fs_info - > super_copy ;
return ! ! ( btrfs_super_incompat_flags ( disk_super ) & flag ) ;
}
2015-09-30 06:50:32 +03:00
# define btrfs_set_fs_compat_ro(__fs_info, opt) \
__btrfs_set_fs_compat_ro ( ( __fs_info ) , BTRFS_FEATURE_COMPAT_RO_ # # opt )
static inline void __btrfs_set_fs_compat_ro ( struct btrfs_fs_info * fs_info ,
u64 flag )
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
features | = flag ;
btrfs_set_super_compat_ro_flags ( disk_super , features ) ;
btrfs_info ( fs_info , " setting %llu ro feature flag " ,
flag ) ;
}
spin_unlock ( & fs_info - > super_lock ) ;
}
}
# define btrfs_clear_fs_compat_ro(__fs_info, opt) \
__btrfs_clear_fs_compat_ro ( ( __fs_info ) , BTRFS_FEATURE_COMPAT_RO_ # # opt )
static inline void __btrfs_clear_fs_compat_ro ( struct btrfs_fs_info * fs_info ,
u64 flag )
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( features & flag ) {
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( features & flag ) {
features & = ~ flag ;
btrfs_set_super_compat_ro_flags ( disk_super , features ) ;
btrfs_info ( fs_info , " clearing %llu ro feature flag " ,
flag ) ;
}
spin_unlock ( & fs_info - > super_lock ) ;
}
}
# define btrfs_fs_compat_ro(fs_info, opt) \
__btrfs_fs_compat_ro ( ( fs_info ) , BTRFS_FEATURE_COMPAT_RO_ # # opt )
static inline int __btrfs_fs_compat_ro ( struct btrfs_fs_info * fs_info , u64 flag )
{
struct btrfs_super_block * disk_super ;
disk_super = fs_info - > super_copy ;
return ! ! ( btrfs_super_compat_ro_flags ( disk_super ) & flag ) ;
}
2008-07-24 20:16:36 +04:00
/* acl.c */
2009-10-13 21:50:18 +04:00
# ifdef CONFIG_BTRFS_FS_POSIX_ACL
2011-07-23 19:37:31 +04:00
struct posix_acl * btrfs_get_acl ( struct inode * inode , int type ) ;
2013-12-20 17:16:43 +04:00
int btrfs_set_acl ( struct inode * inode , struct posix_acl * acl , int type ) ;
2009-11-12 12:35:27 +03:00
int btrfs_init_acl ( struct btrfs_trans_handle * trans ,
struct inode * inode , struct inode * dir ) ;
2011-07-14 07:17:39 +04:00
# else
2011-08-03 11:14:05 +04:00
# define btrfs_get_acl NULL
2013-12-20 17:16:43 +04:00
# define btrfs_set_acl NULL
2011-07-14 07:17:39 +04:00
static inline int btrfs_init_acl ( struct btrfs_trans_handle * trans ,
struct inode * inode , struct inode * dir )
{
return 0 ;
}
# endif
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 21:14:11 +04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
/* relocation.c */
2016-06-22 04:16:51 +03:00
int btrfs_relocate_block_group ( struct btrfs_fs_info * fs_info , u64 group_start ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 18:45:14 +04:00
int btrfs_init_reloc_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
int btrfs_update_reloc_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
int btrfs_recover_relocation ( struct btrfs_root * root ) ;
int btrfs_reloc_clone_csums ( struct inode * inode , u64 file_pos , u64 len ) ;
2013-08-30 23:09:51 +04:00
int btrfs_reloc_cow_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct extent_buffer * buf ,
struct extent_buffer * cow ) ;
2015-08-06 15:58:11 +03:00
void btrfs_reloc_pre_snapshot ( struct btrfs_pending_snapshot * pending ,
2010-05-16 18:49:59 +04:00
u64 * bytes_to_reserve ) ;
2012-03-01 20:24:58 +04:00
int btrfs_reloc_post_snapshot ( struct btrfs_trans_handle * trans ,
2010-05-16 18:49:59 +04:00
struct btrfs_pending_snapshot * pending ) ;
2011-03-08 16:14:00 +03:00
/* scrub.c */
2012-11-05 20:03:39 +04:00
int btrfs_scrub_dev ( struct btrfs_fs_info * fs_info , u64 devid , u64 start ,
u64 end , struct btrfs_scrub_progress * progress ,
2012-11-05 21:29:28 +04:00
int readonly , int is_dev_replace ) ;
2016-06-23 01:54:24 +03:00
void btrfs_scrub_pause ( struct btrfs_fs_info * fs_info ) ;
void btrfs_scrub_continue ( struct btrfs_fs_info * fs_info ) ;
2012-11-05 20:03:39 +04:00
int btrfs_scrub_cancel ( struct btrfs_fs_info * info ) ;
int btrfs_scrub_cancel_dev ( struct btrfs_fs_info * info ,
struct btrfs_device * dev ) ;
2016-06-23 01:54:24 +03:00
int btrfs_scrub_progress ( struct btrfs_fs_info * fs_info , u64 devid ,
2011-03-08 16:14:00 +03:00
struct btrfs_scrub_progress * progress ) ;
2017-04-14 03:35:54 +03:00
static inline void btrfs_init_full_stripe_locks_tree (
struct btrfs_full_stripe_locks_tree * locks_root )
{
locks_root - > root = RB_ROOT ;
mutex_init ( & locks_root - > lock ) ;
}
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 12:46:55 +04:00
/* dev-replace.c */
void btrfs_bio_counter_inc_blocked ( struct btrfs_fs_info * fs_info ) ;
void btrfs_bio_counter_inc_noblocked ( struct btrfs_fs_info * fs_info ) ;
2014-11-25 11:39:28 +03:00
void btrfs_bio_counter_sub ( struct btrfs_fs_info * fs_info , s64 amount ) ;
static inline void btrfs_bio_counter_dec ( struct btrfs_fs_info * fs_info )
{
btrfs_bio_counter_sub ( fs_info , 1 ) ;
}
2011-03-08 16:14:00 +03:00
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 16:33:49 +04:00
/* reada.c */
struct reada_control {
2016-06-23 01:56:44 +03:00
struct btrfs_fs_info * fs_info ; /* tree to prefetch */
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 16:33:49 +04:00
struct btrfs_key key_start ;
struct btrfs_key key_end ; /* exclusive */
atomic_t elems ;
struct kref refcnt ;
wait_queue_head_t wait ;
} ;
struct reada_control * btrfs_reada_add ( struct btrfs_root * root ,
struct btrfs_key * start , struct btrfs_key * end ) ;
int btrfs_reada_wait ( void * handle ) ;
void btrfs_reada_detach ( void * handle ) ;
2017-03-02 21:43:30 +03:00
int btree_readahead_hook ( struct extent_buffer * eb , int err ) ;
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 16:33:49 +04:00
2012-05-29 19:06:54 +04:00
static inline int is_fstree ( u64 rootid )
{
if ( rootid = = BTRFS_FS_TREE_OBJECTID | |
2015-02-27 11:24:23 +03:00
( ( s64 ) rootid > = ( s64 ) BTRFS_FIRST_FREE_OBJECTID & &
! btrfs_qgroup_level ( rootid ) ) )
2012-05-29 19:06:54 +04:00
return 1 ;
return 0 ;
}
2013-02-10 03:38:06 +04:00
static inline int btrfs_defrag_cancelled ( struct btrfs_fs_info * fs_info )
{
return signal_pending ( current ) ;
}
2013-10-11 22:44:09 +04:00
/* Sanity test specific functions */
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
void btrfs_test_destroy_inode ( struct inode * inode ) ;
# endif
2013-02-10 03:38:06 +04:00
2016-06-21 16:52:41 +03:00
static inline int btrfs_is_testing ( struct btrfs_fs_info * fs_info )
2014-09-30 01:53:21 +04:00
{
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
2016-06-21 16:52:41 +03:00
if ( unlikely ( test_bit ( BTRFS_FS_STATE_DUMMY_FS_INFO ,
& fs_info - > fs_state ) ) )
2014-09-30 01:53:21 +04:00
return 1 ;
# endif
return 0 ;
}
2007-02-02 17:18:22 +03:00
# endif