2018-04-03 19:16:55 +02:00
/* SPDX-License-Identifier: GPL-2.0 */
2007-06-12 09:07:21 -04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*/
2018-04-03 19:16:55 +02:00
# ifndef BTRFS_CTREE_H
# define BTRFS_CTREE_H
2007-02-02 09:18:22 -05:00
2007-10-15 16:18:55 -04:00
# include <linux/mm.h>
2017-02-02 19:15:33 +01:00
# include <linux/sched/signal.h>
2007-10-15 16:18:55 -04:00
# include <linux/highmem.h>
2007-03-22 12:13:20 -04:00
# include <linux/fs.h>
2011-03-08 14:14:00 +01:00
# include <linux/rwsem.h>
2013-08-15 17:11:21 +02:00
# include <linux/semaphore.h>
2007-08-29 15:47:34 -04:00
# include <linux/completion.h>
2008-03-26 10:28:07 -04:00
# include <linux/backing-dev.h>
2008-07-17 12:53:50 -04:00
# include <linux/wait.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 11:18:59 +00:00
# include <trace/events/btrfs.h>
2019-05-22 10:18:59 +02:00
# include <asm/unaligned.h>
2011-09-21 15:05:58 -04:00
# include <linux/pagemap.h>
2013-01-29 06:04:50 +00:00
# include <linux/btrfs.h>
2016-04-01 16:14:29 -04:00
# include <linux/btrfs_tree.h>
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-13 17:29:04 -07:00
# include <linux/workqueue.h>
2014-09-23 13:40:08 +08:00
# include <linux/security.h>
2015-12-15 01:42:10 +09:00
# include <linux/sizes.h>
2016-08-31 23:55:33 -04:00
# include <linux/dynamic_debug.h>
2017-03-03 10:55:14 +02:00
# include <linux/refcount.h>
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 11:45:05 +02:00
# include <linux/crc32c.h>
2020-09-24 11:39:12 -05:00
# include <linux/iomap.h>
2019-09-23 10:05:19 -04:00
# include "extent-io-tree.h"
2008-01-24 16:13:08 -05:00
# include "extent_io.h"
2007-10-15 16:14:19 -04:00
# include "extent_map.h"
2008-06-11 16:50:36 -04:00
# include "async-thread.h"
2019-06-19 13:47:17 -04:00
# include "block-rsv.h"
2020-01-30 14:59:44 +02:00
# include "locking.h"
2007-03-22 12:13:20 -04:00
2007-03-16 16:20:31 -04:00
struct btrfs_trans_handle ;
2007-03-22 15:59:16 -04:00
struct btrfs_transaction ;
2010-05-16 10:48:46 -04:00
struct btrfs_pending_snapshot ;
2018-11-21 14:05:41 -05:00
struct btrfs_delayed_ref_root ;
2019-06-18 16:09:16 -04:00
struct btrfs_space_info ;
2019-10-29 19:20:18 +01:00
struct btrfs_block_group ;
2007-05-02 15:53:43 -04:00
extern struct kmem_cache * btrfs_trans_handle_cachep ;
extern struct kmem_cache * btrfs_bit_radix_cachep ;
2007-04-02 10:50:19 -04:00
extern struct kmem_cache * btrfs_path_cachep ;
2011-01-28 17:05:48 -05:00
extern struct kmem_cache * btrfs_free_space_cachep ;
btrfs: fix allocation of free space cache v1 bitmap pages
Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of the kernel.
After some time, it has been made a relation with the use of BTRFS
filesystem and with SLUB_DEBUG turned on.
[ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten
[ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
[ 22.811193] __slab_alloc.constprop.26+0x44/0x70
[ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec
[ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs]
[ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs]
[ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs]
[ 22.812321] find_free_extent+0x680/0x12a4 [btrfs]
[ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs]
[ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs]
[ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs]
[ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs]
[ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[ 22.813973] close_ctree+0xf8/0x2a4 [btrfs]
[ 22.814107] generic_shutdown_super+0x80/0x110
[ 22.814250] kill_anon_super+0x18/0x30
[ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs]
[ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[ 22.814841] proc_cgroup_show+0xc0/0x248
[ 22.814967] proc_single_show+0x54/0x98
[ 22.815086] seq_read+0x278/0x45c
[ 22.815190] __vfs_read+0x28/0x17c
[ 22.815289] vfs_read+0xa8/0x14c
[ 22.815381] ksys_read+0x50/0x94
[ 22.815475] ret_from_syscall+0x0/0x38
Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead of
memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
have the size of a page, they were allocated with kzalloc().
Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG are
activated, kzalloc() may return unaligned pointer.
On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.
To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_free_space_bitmap ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-21 15:05:55 +00:00
extern struct kmem_cache * btrfs_free_space_bitmap_cachep ;
2008-07-17 12:53:50 -04:00
struct btrfs_ordered_sum ;
2019-04-04 14:45:35 +08:00
struct btrfs_ref ;
2007-03-16 16:20:31 -04:00
2013-02-20 00:55:13 +00:00
# define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
2007-02-02 09:18:22 -05:00
2019-05-31 17:22:59 +02:00
/*
* Maximum number of mirrors that can be available for all profiles counting
* the target device of dev - replace as one . During an active device replace
* procedure , the target device of the copy operation is a mirror for the
* filesystem data as well that can be used to read data in order to repair
* read errors on other disks .
*
2018-03-02 22:56:53 +01:00
* Current value is derived from RAID1C4 with 4 copies .
2019-05-31 17:22:59 +02:00
*/
2018-03-02 22:56:53 +01:00
# define BTRFS_MAX_MIRRORS (4 + 1)
2012-03-27 14:21:26 -04:00
2009-02-12 14:09:45 -05:00
# define BTRFS_MAX_LEVEL 8
2008-03-24 15:01:56 -04:00
2018-03-07 17:29:18 +08:00
# define BTRFS_OLDEST_GENERATION 0ULL
2007-03-22 12:13:20 -04:00
/*
* we can actually store much bigger names , but lets not confuse the rest
* of linux
*/
# define BTRFS_NAME_LEN 255
2012-08-08 11:32:27 -07:00
/*
* Theoretical limit is larger , but we keep this down to a sane
* value . That should limit greatly the possibility of collisions on
* inode ref items .
*/
# define BTRFS_LINK_MAX 65535U
2007-12-12 14:38:19 -05:00
# define BTRFS_EMPTY_DIR_SIZE 0
2007-03-29 15:15:27 -04:00
2012-02-03 11:20:04 +01:00
/* ioprio of readahead is set to idle */
# define BTRFS_IOPRIO_READA (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0))
2015-12-15 01:42:10 +09:00
# define BTRFS_DIRTY_METADATA_THRESH SZ_32M
2013-01-29 10:09:20 +00:00
btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.
So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.
[Test]
We tested the patch on a 4-cores x86 machine:
1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file
We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.
[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
total_bytes_pinned_percpu: 82 82
0.21 0.61 29.46 0.36 298340
635973 0.09 11.01 173476.25 0.27
patched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
total_bytes_pinned_percpu: 1 1
0.62 0.62 0.62 0.62 13601
31542 0.14 9.61 11016.90 0.35
[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.
Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-13 16:50:42 +08:00
/*
* Use large batch size to reduce overhead of metadata updates . On the reader
* side , we only read it when we are close to ENOSPC and the read overhead is
* mostly related to the number of CPUs , so it is OK to use arbitrary large
* value here .
*/
# define BTRFS_TOTAL_BYTES_PINNED_BATCH SZ_128M
2015-12-15 01:42:10 +09:00
# define BTRFS_MAX_EXTENT_SIZE SZ_128M
2015-02-11 15:08:59 -05:00
2019-12-13 16:22:20 -08:00
/*
* Deltas are an effective way to populate global statistics . Give macro names
* to make it clear what we ' re doing . An example is discard_extents in
* btrfs_free_space_ctl .
*/
# define BTRFS_STAT_NR_ENTRIES 2
# define BTRFS_STAT_CURR 0
# define BTRFS_STAT_PREV 1
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 11:45:05 +02:00
2017-01-04 11:09:51 +01:00
/*
* Count how many BTRFS_MAX_EXTENT_SIZE cover the @ size
*/
static inline u32 count_max_extents ( u64 size )
{
return div_u64 ( size + BTRFS_MAX_EXTENT_SIZE - 1 , BTRFS_MAX_EXTENT_SIZE ) ;
}
2008-03-24 15:01:56 -04:00
static inline unsigned long btrfs_chunk_item_size ( int num_stripes )
{
BUG_ON ( num_stripes = = 0 ) ;
return sizeof ( struct btrfs_chunk ) +
sizeof ( struct btrfs_stripe ) * ( num_stripes - 1 ) ;
}
2011-01-06 19:30:25 +08:00
/*
2018-11-27 14:50:27 +01:00
* Runtime ( in - memory ) states of filesystem
2011-01-06 19:30:25 +08:00
*/
2018-11-27 14:50:27 +01:00
enum {
/* Global indicator of serious filesystem errors */
BTRFS_FS_STATE_ERROR ,
/*
* Filesystem is being remounted , allow to skip some operations , like
* defrag
*/
BTRFS_FS_STATE_REMOUNTING ,
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
/* Filesystem in RO mode */
BTRFS_FS_STATE_RO ,
2018-11-27 14:50:27 +01:00
/* Track if a transaction abort has been reported on this filesystem */
BTRFS_FS_STATE_TRANS_ABORTED ,
/*
* Bio operations should be blocked on this filesystem because a source
* or target device is being destroyed as part of a device replace
*/
BTRFS_FS_STATE_DEV_REPLACING ,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO ,
} ;
2011-01-06 19:30:25 +08:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
# define BTRFS_BACKREF_REV_MAX 256
# define BTRFS_BACKREF_REV_SHIFT 56
# define BTRFS_BACKREF_REV_MASK (((u64)BTRFS_BACKREF_REV_MAX - 1) << \
BTRFS_BACKREF_REV_SHIFT )
# define BTRFS_OLD_BACKREF_REV 0
# define BTRFS_MIXED_BACKREF_REV 1
2008-04-01 11:21:32 -04:00
2007-02-26 10:40:21 -05:00
/*
* every tree block ( leaf or node ) starts with this header .
*/
2007-03-12 12:29:44 -04:00
struct btrfs_header {
2008-04-15 15:41:47 -04:00
/* these first four must match the super block */
2007-03-29 15:15:27 -04:00
u8 csum [ BTRFS_CSUM_SIZE ] ;
2007-10-15 16:14:19 -04:00
u8 fsid [ BTRFS_FSID_SIZE ] ; /* FS specific uuid */
2007-10-15 16:15:53 -04:00
__le64 bytenr ; /* which block this node is supposed to live in */
2008-04-01 11:21:32 -04:00
__le64 flags ;
2008-04-15 15:41:47 -04:00
/* allowed to be different from the super from here on down */
u8 chunk_tree_uuid [ BTRFS_UUID_SIZE ] ;
2007-03-23 15:56:19 -04:00
__le64 generation ;
2007-04-20 20:23:12 -04:00
__le64 owner ;
2007-10-15 16:14:19 -04:00
__le32 nritems ;
2007-03-27 09:06:38 -04:00
u8 level ;
2007-02-02 09:18:22 -05:00
} __attribute__ ( ( __packed__ ) ) ;
2008-03-24 15:01:56 -04:00
/*
* this is a very generous portion of the super block , giving us
* room to translate 14 chunks with 3 stripes each .
*/
# define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
2011-11-03 15:17:42 -04:00
/*
* just in case we somehow lose the roots and are not able to mount ,
* we store an array of the roots from previous transactions
* in the super .
*/
# define BTRFS_NUM_BACKUP_ROOTS 4
struct btrfs_root_backup {
__le64 tree_root ;
__le64 tree_root_gen ;
__le64 chunk_root ;
__le64 chunk_root_gen ;
__le64 extent_root ;
__le64 extent_root_gen ;
__le64 fs_root ;
__le64 fs_root_gen ;
__le64 dev_root ;
__le64 dev_root_gen ;
__le64 csum_root ;
__le64 csum_root_gen ;
__le64 total_bytes ;
__le64 bytes_used ;
__le64 num_devices ;
/* future */
2012-10-31 15:16:32 +00:00
__le64 unused_64 [ 4 ] ;
2011-11-03 15:17:42 -04:00
u8 tree_root_level ;
u8 chunk_root_level ;
u8 extent_root_level ;
u8 fs_root_level ;
u8 dev_root_level ;
u8 csum_root_level ;
/* future and to align */
u8 unused_8 [ 10 ] ;
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 10:40:21 -05:00
/*
* the super block basically lists the main trees of the FS
* it currently lacks any block count etc etc
*/
2007-03-13 10:46:10 -04:00
struct btrfs_super_block {
2008-04-01 11:21:32 -04:00
/* the first 4 fields must match struct btrfs_header */
2018-10-30 16:43:23 +02:00
u8 csum [ BTRFS_CSUM_SIZE ] ;
/* FS specific UUID, visible to user */
u8 fsid [ BTRFS_FSID_SIZE ] ;
2007-10-15 16:15:53 -04:00
__le64 bytenr ; /* this block number */
2008-04-01 11:21:32 -04:00
__le64 flags ;
2008-04-15 15:41:47 -04:00
/* allowed to be different from the btrfs_header from here own down */
2007-03-13 16:47:54 -04:00
__le64 magic ;
__le64 generation ;
__le64 root ;
2008-03-24 15:01:56 -04:00
__le64 chunk_root ;
2008-09-05 16:13:11 -04:00
__le64 log_root ;
2008-12-08 16:40:21 -05:00
/* this will help find the new super based on the log root */
__le64 log_root_transid ;
2007-10-15 16:15:53 -04:00
__le64 total_bytes ;
__le64 bytes_used ;
2007-03-21 11:12:56 -04:00
__le64 root_dir_objectid ;
2008-03-24 15:02:07 -04:00
__le64 num_devices ;
2007-10-15 16:14:19 -04:00
__le32 sectorsize ;
__le32 nodesize ;
2014-06-04 19:22:26 +02:00
__le32 __unused_leafsize ;
2007-11-30 11:30:34 -05:00
__le32 stripesize ;
2008-03-24 15:01:56 -04:00
__le32 sys_chunk_array_size ;
2008-10-29 14:49:05 -04:00
__le64 chunk_root_generation ;
2008-12-02 06:36:08 -05:00
__le64 compat_flags ;
__le64 compat_ro_flags ;
__le64 incompat_flags ;
2008-12-02 07:17:45 -05:00
__le16 csum_type ;
2007-10-15 16:15:53 -04:00
u8 root_level ;
2008-03-24 15:01:56 -04:00
u8 chunk_root_level ;
2008-09-05 16:13:11 -04:00
u8 log_root_level ;
2008-03-24 15:02:07 -04:00
struct btrfs_dev_item dev_item ;
2008-12-08 16:40:21 -05:00
2008-04-18 10:29:49 -04:00
char label [ BTRFS_LABEL_SIZE ] ;
2008-12-08 16:40:21 -05:00
2010-06-21 14:48:16 -04:00
__le64 cache_generation ;
2013-08-15 17:11:22 +02:00
__le64 uuid_tree_generation ;
2010-06-21 14:48:16 -04:00
2018-10-30 16:43:23 +02:00
/* the UUID written into btree blocks */
u8 metadata_uuid [ BTRFS_FSID_SIZE ] ;
2008-12-08 16:40:21 -05:00
/* future expansion */
2018-10-30 16:43:23 +02:00
__le64 reserved [ 28 ] ;
2008-03-24 15:01:56 -04:00
u8 sys_chunk_array [ BTRFS_SYSTEM_CHUNK_ARRAY_SIZE ] ;
2011-11-03 15:17:42 -04:00
struct btrfs_root_backup super_roots [ BTRFS_NUM_BACKUP_ROOTS ] ;
2007-02-21 17:04:57 -05:00
} __attribute__ ( ( __packed__ ) ) ;
2008-12-02 06:36:08 -05:00
/*
* Compat flags that we support . If any incompat flags are set other than the
* ones specified below then we will fail to mount
*/
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
# define BTRFS_FEATURE_COMPAT_SUPP 0ULL
2013-11-15 15:33:55 -05:00
# define BTRFS_FEATURE_COMPAT_SAFE_SET 0ULL
# define BTRFS_FEATURE_COMPAT_SAFE_CLEAR 0ULL
2015-09-29 20:50:38 -07:00
# define BTRFS_FEATURE_COMPAT_RO_SUPP \
2016-09-22 17:24:22 -07:00
( BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE | \
BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID )
2015-09-29 20:50:38 -07:00
2013-11-15 15:33:55 -05:00
# define BTRFS_FEATURE_COMPAT_RO_SAFE_SET 0ULL
# define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
2010-06-21 14:48:16 -04:00
# define BTRFS_FEATURE_INCOMPAT_SUPP \
( BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \
2010-09-16 16:19:09 -04:00
BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL | \
2010-10-25 15:12:26 +08:00
BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \
2010-08-06 13:21:20 -04:00
BTRFS_FEATURE_INCOMPAT_BIG_METADATA | \
2012-08-08 11:32:27 -07:00
BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \
btrfs: Add zstd support
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.
I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].
I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.
The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.
| Method | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None | 0.99 | 504 | 686 |
| lzo | 1.66 | 398 | 442 |
| zlib | 2.58 | 65 | 241 |
| zstd 1 | 2.57 | 260 | 383 |
| zstd 3 | 2.71 | 174 | 408 |
| zstd 6 | 2.87 | 70 | 398 |
| zstd 9 | 2.92 | 43 | 406 |
| zstd 12 | 2.93 | 21 | 408 |
| zstd 15 | 3.01 | 11 | 354 |
The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.
| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 |
| lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 |
| zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 |
| zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 |
[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz
zstd source repository: https://github.com/facebook/zstd
Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-08-09 19:39:02 -07:00
BTRFS_FEATURE_INCOMPAT_COMPRESS_ZSTD | \
2013-01-29 18:40:14 -05:00
BTRFS_FEATURE_INCOMPAT_RAID56 | \
2013-03-07 14:22:04 -05:00
BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
2013-10-22 12:18:51 -04:00
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
2018-10-30 16:43:23 +02:00
BTRFS_FEATURE_INCOMPAT_NO_HOLES | \
2018-07-10 18:15:05 +02:00
BTRFS_FEATURE_INCOMPAT_METADATA_UUID | \
2021-02-04 19:22:21 +09:00
BTRFS_FEATURE_INCOMPAT_RAID1C34 | \
BTRFS_FEATURE_INCOMPAT_ZONED )
2008-12-02 06:36:08 -05:00
2013-11-15 15:33:55 -05:00
# define BTRFS_FEATURE_INCOMPAT_SAFE_SET \
( BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF )
# define BTRFS_FEATURE_INCOMPAT_SAFE_CLEAR 0ULL
2008-12-02 06:36:08 -05:00
2007-02-26 10:40:21 -05:00
/*
2007-03-15 12:56:47 -04:00
* A leaf is full of items . offset and size tell us where to find
2007-02-26 10:40:21 -05:00
* the item in the leaf ( relative to the start of the data area )
*/
2007-03-12 20:12:07 -04:00
struct btrfs_item {
2007-03-12 16:22:34 -04:00
struct btrfs_disk_key key ;
2007-03-14 14:14:43 -04:00
__le32 offset ;
2007-10-15 16:14:19 -04:00
__le32 size ;
2007-02-02 09:18:22 -05:00
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 10:40:21 -05:00
/*
* leaves have an item area and a data area :
* [ item0 , item1 . . . . itemN ] [ free space ] [ dataN . . . data1 , data0 ]
*
* The data is separate from the items to get the keys closer together
* during searches .
*/
2007-03-13 10:46:10 -04:00
struct btrfs_leaf {
2007-03-12 12:29:44 -04:00
struct btrfs_header header ;
2007-03-14 14:14:43 -04:00
struct btrfs_item items [ ] ;
2007-02-02 09:18:22 -05:00
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 10:40:21 -05:00
/*
* all non - leaf blocks are nodes , they hold only keys and pointers to
* other blocks
*/
2007-03-14 14:14:43 -04:00
struct btrfs_key_ptr {
struct btrfs_disk_key key ;
__le64 blockptr ;
2007-12-11 09:25:06 -05:00
__le64 generation ;
2007-03-14 14:14:43 -04:00
} __attribute__ ( ( __packed__ ) ) ;
2007-03-13 10:46:10 -04:00
struct btrfs_node {
2007-03-12 12:29:44 -04:00
struct btrfs_header header ;
2007-03-14 14:14:43 -04:00
struct btrfs_key_ptr ptrs [ ] ;
2007-02-02 09:18:22 -05:00
} __attribute__ ( ( __packed__ ) ) ;
2007-02-26 10:40:21 -05:00
/*
2007-03-13 10:46:10 -04:00
* btrfs_paths remember the path taken from the root down to the leaf .
* level 0 is always the leaf , and nodes [ 1. . . BTRFS_MAX_LEVEL ] will point
2007-02-26 10:40:21 -05:00
* to any other levels that are present .
*
* The slots array records the index of the item or block pointer
* used while walking the tree .
*/
2018-11-27 15:25:13 +01:00
enum { READA_NONE , READA_BACK , READA_FORWARD } ;
2007-03-13 10:46:10 -04:00
struct btrfs_path {
2007-10-15 16:14:19 -04:00
struct extent_buffer * nodes [ BTRFS_MAX_LEVEL ] ;
2007-03-13 10:46:10 -04:00
int slots [ BTRFS_MAX_LEVEL ] ;
2008-06-25 16:01:30 -04:00
/* if there is real range locking, this locks field will change */
2015-11-27 16:31:45 +01:00
u8 locks [ BTRFS_MAX_LEVEL ] ;
2015-11-27 16:31:38 +01:00
u8 reada ;
2008-06-25 16:01:30 -04:00
/* keep some upper locks as we walk down */
2015-11-27 16:31:42 +01:00
u8 lowest_level ;
2008-12-10 09:10:46 -05:00
/*
* set by btrfs_split_item , tells search_slot to keep all locks
* and to force calls to keep space in the nodes
*/
2009-03-13 11:00:37 -04:00
unsigned int search_for_split : 1 ;
unsigned int keep_locks : 1 ;
unsigned int skip_locking : 1 ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
unsigned int search_commit_root : 1 ;
2014-03-28 17:16:01 -04:00
unsigned int need_commit_sem : 1 ;
2014-11-09 08:38:39 +00:00
unsigned int skip_release_on_error : 1 ;
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 17:25:12 +08:00
/*
* Indicate that new item ( btrfs_search_slot ) is extending already
* existing item and ins_len contains only the data size and not item
* header ( ie . sizeof ( struct btrfs_item ) is not included ) .
*/
unsigned int search_for_extension : 1 ;
2007-02-02 09:18:22 -05:00
} ;
2016-06-15 09:22:56 -04:00
# define BTRFS_MAX_EXTENT_ITEM_SIZE(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info) >> 4) - \
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
sizeof ( struct btrfs_item ) )
2012-11-05 17:26:40 +01:00
struct btrfs_dev_replace {
u64 replace_state ; /* see #define above */
2018-06-12 17:18:25 +05:30
time64_t time_started ; /* seconds since 1-Jan-1970 */
time64_t time_stopped ; /* seconds since 1-Jan-1970 */
2012-11-05 17:26:40 +01:00
atomic64_t num_write_errors ;
atomic64_t num_uncorrectable_read_errors ;
u64 cursor_left ;
u64 committed_cursor_left ;
u64 cursor_left_last_write_of_item ;
u64 cursor_right ;
u64 cont_reading_from_srcdev_mode ; /* see #define above */
int is_valid ;
int item_needs_writeback ;
struct btrfs_device * srcdev ;
struct btrfs_device * tgtdev ;
struct mutex lock_finishing_cancel_unmount ;
2018-04-05 01:29:24 +02:00
struct rw_semaphore rwsem ;
2012-11-05 17:26:40 +01:00
struct btrfs_scrub_progress scrub_progress ;
2018-04-05 01:04:49 +02:00
struct percpu_counter bio_counter ;
wait_queue_head_t replace_wait ;
2012-11-05 17:26:40 +01:00
} ;
2009-04-03 09:47:43 -04:00
/*
* free clusters are used to claim free space in relatively large chunks ,
btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-28 08:31:28 +02:00
* allowing us to do less seeky writes . They are used for all metadata
* allocations . In ssd_spread mode they are also used for data allocations .
2009-04-03 09:47:43 -04:00
*/
struct btrfs_free_cluster {
spinlock_t lock ;
spinlock_t refill_lock ;
struct rb_root root ;
/* largest extent in this cluster */
u64 max_size ;
/* first extent starting offset */
u64 window_start ;
2015-10-02 15:25:10 -04:00
/* We did a full search and couldn't create a cluster */
bool fragmented ;
2019-10-29 19:20:18 +01:00
struct btrfs_block_group * block_group ;
2009-04-03 09:47:43 -04:00
/*
* when a cluster is allocated from a block group , we put the
* cluster onto a list in the block group so that it can
* be freed before the block group is freed .
*/
struct list_head block_group_list ;
2008-03-24 15:01:59 -04:00
} ;
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
enum btrfs_caching_type {
2018-11-27 15:25:13 +01:00
BTRFS_CACHE_NO ,
BTRFS_CACHE_STARTED ,
BTRFS_CACHE_FAST ,
BTRFS_CACHE_FINISHED ,
BTRFS_CACHE_ERROR ,
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
} ;
2017-04-14 08:35:54 +08:00
/*
* Tree to record all locked full stripes of a RAID5 / 6 block group
*/
struct btrfs_full_stripe_locks_tree {
struct rb_root root ;
struct mutex lock ;
} ;
2019-12-13 16:22:14 -08:00
/* Discard control. */
/*
* Async discard uses multiple lists to differentiate the discard filter
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
* parameters . Index 0 is for completely free block groups where we need to
* ensure the entire block group is trimmed without being lossy . Indices
* afterwards represent monotonically decreasing discard filter sizes to
* prioritize what should be discarded next .
2019-12-13 16:22:14 -08:00
*/
2020-01-02 16:26:39 -05:00
# define BTRFS_NR_DISCARD_LISTS 3
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
# define BTRFS_DISCARD_INDEX_UNUSED 0
# define BTRFS_DISCARD_INDEX_START 1
2019-12-13 16:22:14 -08:00
struct btrfs_discard_ctl {
struct workqueue_struct * discard_workers ;
struct delayed_work work ;
spinlock_t lock ;
struct btrfs_block_group * block_group ;
struct list_head discard_list [ BTRFS_NR_DISCARD_LISTS ] ;
2020-01-02 16:26:36 -05:00
u64 prev_discard ;
2020-11-04 09:45:53 +00:00
u64 prev_discard_time ;
2019-12-13 16:22:20 -08:00
atomic_t discardable_extents ;
2019-12-13 16:22:21 -08:00
atomic64_t discardable_bytes ;
2020-01-02 16:26:38 -05:00
u64 max_discard_size ;
2020-11-04 09:45:52 +00:00
u64 delay_ms ;
2020-01-02 16:26:35 -05:00
u32 iops_limit ;
2020-01-02 16:26:36 -05:00
u32 kbps_limit ;
2020-01-02 16:26:41 -05:00
u64 discard_extent_bytes ;
u64 discard_bitmap_bytes ;
atomic64_t discard_bytes_saved ;
2019-12-13 16:22:14 -08:00
} ;
2012-06-21 11:08:04 +02:00
/* delayed seq elem */
struct seq_list {
struct list_head list ;
u64 seq ;
} ;
2015-02-25 15:47:32 +01:00
# define SEQ_LIST_INIT(name) { .list = LIST_HEAD_INIT((name).list), .seq = 0 }
2017-03-16 10:04:34 -06:00
# define SEQ_LAST ((u64)-1)
2013-02-07 16:06:02 -05:00
enum btrfs_orphan_cleanup_state {
ORPHAN_CLEANUP_STARTED = 1 ,
ORPHAN_CLEANUP_DONE = 2 ,
} ;
2020-07-21 10:22:33 -04:00
void btrfs_init_async_reclaim_work ( struct btrfs_fs_info * fs_info ) ;
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-13 17:29:04 -07:00
2012-06-21 11:08:04 +02:00
/* fs_info */
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
struct reloc_control ;
2008-03-24 15:01:56 -04:00
struct btrfs_device ;
2008-03-24 15:02:07 -04:00
struct btrfs_fs_devices ;
2012-01-16 22:04:47 +02:00
struct btrfs_balance_control ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
struct btrfs_delayed_root ;
2016-09-02 15:40:02 -04:00
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
/*
* Block group or device which contains an active swapfile . Used for preventing
* unsafe operations while a swapfile is active .
*
* These are sorted on ( ptr , inode ) ( note that a block group or device can
* contain more than one swapfile ) . We compare the pointer values because we
* don ' t actually care what the object is , we just need a quick check whether
* the object exists in the rbtree .
*/
struct btrfs_swapfile_pin {
struct rb_node node ;
void * ptr ;
struct inode * inode ;
/*
2019-10-29 19:20:18 +01:00
* If true , ptr points to a struct btrfs_block_group . Otherwise , ptr
* points to a struct btrfs_device .
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
*/
bool is_block_group ;
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 12:55:37 +00:00
/*
* Only used when ' is_block_group ' is true and it is the number of
* extents used by a swapfile for this block group ( ' ptr ' field ) .
*/
int bg_extent_count ;
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
} ;
bool btrfs_pinned_by_swapfile ( struct btrfs_fs_info * fs_info , void * ptr ) ;
2018-11-27 14:55:46 +01:00
enum {
BTRFS_FS_BARRIER ,
BTRFS_FS_CLOSING_START ,
BTRFS_FS_CLOSING_DONE ,
BTRFS_FS_LOG_RECOVERING ,
BTRFS_FS_OPEN ,
BTRFS_FS_QUOTA_ENABLED ,
BTRFS_FS_UPDATE_UUID_TREE_GEN ,
BTRFS_FS_CREATING_FREE_SPACE_TREE ,
BTRFS_FS_BTREE_ERR ,
BTRFS_FS_LOG1_ERR ,
BTRFS_FS_LOG2_ERR ,
BTRFS_FS_QUOTA_OVERRIDE ,
/* Used to record internally whether fs has been frozen */
BTRFS_FS_FROZEN ,
/*
* Indicate that balance has been set up from the ioctl and is in the
* main phase . The fs_info : : balance_ctl is initialized .
Btrfs: prevent send failures and crashes due to concurrent relocation
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 16:44:09 +01:00
* Set and cleared while holding fs_info : : balance_mutex .
2018-11-27 14:55:46 +01:00
*/
BTRFS_FS_BALANCE_RUNNING ,
2019-01-11 10:21:02 -05:00
/* Indicate that the cleaner thread is awake and doing something. */
BTRFS_FS_CLEANER_RUNNING ,
btrfs: detect fast implementation of crc32c on all architectures
Currently, there's only check for fast crc32c implementation on X86,
based on the CPU flags. This is used to decide if checksumming should be
offloaded to worker threads or can be calculated by the caller.
As there are more architectures that implement a faster version of
crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw
cards.
The detection is based on driver name, all generic C implementations
contain 'generic', while the specialized versions do not. Alternatively
the priority could be used, but this is not currently provided by the
crypto API.
The flag is set per-filesystem at mount time and used for the offloading
decisions.
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16 13:39:59 +02:00
/*
* The checksumming has an optimized version and is considered fast ,
* so we don ' t need to offload checksums to workqueues .
*/
BTRFS_FS_CSUM_IMPL_FAST ,
2019-12-13 16:22:14 -08:00
/* Indicate that the discard workqueue can service discards. */
BTRFS_FS_DISCARD_RUNNING ,
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:22 -08:00
/* Indicate that we need to cleanup space cache v1 */
BTRFS_FS_CLEANUP_SPACE_CACHE_V1 ,
btrfs: fix possible free space tree corruption with online conversion
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-15 16:26:17 -05:00
/* Indicate that we can't trust the free space tree for caching yet */
BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED ,
2018-11-27 14:55:46 +01:00
} ;
2018-03-21 01:31:04 +01:00
2020-08-25 10:02:32 -05:00
/*
* Exclusive operations ( device replace , resize , device add / remove , balance )
*/
enum btrfs_exclusive_operation {
BTRFS_EXCLOP_NONE ,
BTRFS_EXCLOP_BALANCE ,
BTRFS_EXCLOP_DEV_ADD ,
BTRFS_EXCLOP_DEV_REMOVE ,
BTRFS_EXCLOP_DEV_REPLACE ,
BTRFS_EXCLOP_RESIZE ,
BTRFS_EXCLOP_SWAP_ACTIVATE ,
} ;
2007-03-20 14:38:32 -04:00
struct btrfs_fs_info {
2008-04-15 15:41:47 -04:00
u8 chunk_tree_uuid [ BTRFS_UUID_SIZE ] ;
2016-09-02 15:40:02 -04:00
unsigned long flags ;
2007-03-15 12:56:47 -04:00
struct btrfs_root * extent_root ;
struct btrfs_root * tree_root ;
2008-03-24 15:01:56 -04:00
struct btrfs_root * chunk_root ;
struct btrfs_root * dev_root ;
2008-11-17 21:02:50 -05:00
struct btrfs_root * fs_root ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
struct btrfs_root * csum_root ;
2011-09-13 12:56:09 +02:00
struct btrfs_root * quota_root ;
2013-08-15 17:11:19 +02:00
struct btrfs_root * uuid_root ;
2015-09-29 20:50:35 -07:00
struct btrfs_root * free_space_root ;
2020-05-15 14:01:42 +08:00
struct btrfs_root * data_reloc_root ;
2008-09-05 16:13:11 -04:00
/* the log root tree is a directory of all the other log roots */
struct btrfs_root * log_root_tree ;
2009-09-21 15:56:00 -04:00
spinlock_t fs_roots_radix_lock ;
2007-04-09 10:42:37 -04:00
struct radix_tree_root fs_roots_radix ;
2007-10-15 16:15:26 -04:00
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
/* block group cache stuff */
spinlock_t block_group_cache_lock ;
2012-12-27 09:01:23 +00:00
u64 first_logical_byte ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
struct rb_root block_group_cache_tree ;
2011-09-26 17:12:22 -04:00
/* keep track of unallocated space */
2017-05-11 09:17:46 +03:00
atomic64_t free_chunk_space ;
2011-09-26 17:12:22 -04:00
2020-01-20 16:09:18 +02:00
/* Track ranges which are used by log trees blocks/logged data extents */
struct extent_io_tree excluded_extents ;
2007-10-15 16:15:26 -04:00
2008-03-24 15:01:56 -04:00
/* logical->physical extent mapping */
2019-05-17 11:43:17 +02:00
struct extent_map_tree mapping_tree ;
2008-03-24 15:01:56 -04:00
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
/*
* block reservation for extent , checksum , root tree and
* delayed dir index item
*/
2010-05-16 10:46:25 -04:00
struct btrfs_block_rsv global_block_rsv ;
/* block reservation for metadata operations */
struct btrfs_block_rsv trans_block_rsv ;
/* block reservation for chunk tree */
struct btrfs_block_rsv chunk_block_rsv ;
2011-11-03 22:54:25 -04:00
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv ;
btrfs: introduce delayed_refs_rsv
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.
So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.
For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.
Historical background:
The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.
Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-03 10:20:33 -05:00
/* block reservation for delayed refs */
struct btrfs_block_rsv delayed_refs_rsv ;
2010-05-16 10:46:25 -04:00
struct btrfs_block_rsv empty_block_rsv ;
2007-03-20 15:57:25 -04:00
u64 generation ;
2007-08-10 16:22:09 -04:00
u64 last_trans_committed ;
2014-01-23 10:54:11 -05:00
u64 avg_delayed_ref_runtime ;
2009-03-24 10:24:20 -04:00
/*
* this is updated to the current trans every time a full commit
* is required instead of the faster short fsync log commits
*/
u64 last_trans_log_full_commit ;
2012-03-30 13:58:32 +02:00
unsigned long mount_opt ;
2014-02-05 15:26:17 +01:00
/*
* Track requests for actions that need to be done during transaction
* commit ( like for some mount options ) .
*/
unsigned long pending_changes ;
2010-12-17 14:21:50 +08:00
unsigned long compress_type : 4 ;
2017-09-15 17:36:57 +02:00
unsigned int compress_level ;
2018-02-13 17:50:46 +08:00
u32 commit_interval ;
2013-01-29 10:05:05 +00:00
/*
* It is a suggestive number , the read side is safe even it gets a
* wrong number because we will write out the data into a regular
* extent . The write side ( mount / remount ) is under - > s_umount lock ,
* so it is also safe .
*/
2008-01-29 16:03:38 -05:00
u64 max_inline ;
2017-06-15 01:30:06 +02:00
2007-03-22 15:59:16 -04:00
struct btrfs_transaction * running_transaction ;
2008-07-17 12:53:50 -04:00
wait_queue_head_t transaction_throttle ;
2008-07-17 12:54:14 -04:00
wait_queue_head_t transaction_wait ;
2010-10-29 15:37:34 -04:00
wait_queue_head_t transaction_blocked_wait ;
2008-11-06 22:02:51 -05:00
wait_queue_head_t async_submit_wait ;
2008-09-05 16:13:11 -04:00
2013-04-11 10:30:16 +00:00
/*
* Used to protect the incompat_flags , compat_flags , compat_ro_flags
* when they are updated .
*
* Because we do not clear the flags for ever , so we needn ' t use
* the lock on the read side .
*
* We also needn ' t use the lock when we mount the fs , because
* there is no other task which will update the flag .
*/
spinlock_t super_lock ;
2011-04-13 15:41:04 +02:00
struct btrfs_super_block * super_copy ;
struct btrfs_super_block * super_for_commit ;
2007-03-22 12:13:20 -04:00
struct super_block * sb ;
2007-03-28 13:57:48 -04:00
struct inode * btree_inode ;
2008-09-05 16:13:11 -04:00
struct mutex tree_log_mutex ;
2008-06-25 16:01:31 -04:00
struct mutex transaction_kthread_mutex ;
struct mutex cleaner_mutex ;
2008-06-25 16:01:30 -04:00
struct mutex chunk_mutex ;
2013-01-29 18:40:14 -05:00
2015-04-06 12:46:08 -07:00
/*
* this is taken to make sure we don ' t set block groups ro after
* the free space cache has been allocated on them
*/
struct mutex ro_block_group_mutex ;
2013-01-29 18:40:14 -05:00
/* this is used during read/modify/write to make sure
* no two ios are trying to mod the same stripe at the same
* time
*/
struct btrfs_stripe_hash_table * stripe_hash_table ;
2009-03-31 13:27:11 -04:00
/*
* this protects the ordered operations list only while we are
* processing all of the entries on it . This way we make
* sure the commit code doesn ' t find the list temporarily empty
* because another function happens to be doing non - waiting preflush
* before jumping into the main commit .
*/
struct mutex ordered_operations_mutex ;
2013-08-14 11:33:56 -04:00
2014-03-13 15:42:13 -04:00
struct rw_semaphore commit_root_sem ;
2009-03-31 13:27:11 -04:00
2009-11-12 09:34:40 +00:00
struct rw_semaphore cleanup_work_sem ;
2009-09-21 16:00:26 -04:00
2009-11-12 09:34:40 +00:00
struct rw_semaphore subvol_sem ;
2009-09-21 16:00:26 -04:00
2011-04-11 17:25:13 -04:00
spinlock_t trans_lock ;
2011-06-13 20:00:16 -04:00
/*
* the reloc mutex goes with the trans lock , it is taken
* during commit to protect us from the relocation code
*/
struct mutex reloc_mutex ;
2007-04-19 21:01:03 -04:00
struct list_head trans_list ;
2007-06-08 18:11:48 -04:00
struct list_head dead_roots ;
2009-09-11 16:11:19 -04:00
struct list_head caching_block_groups ;
2008-09-05 16:13:11 -04:00
2009-11-12 09:36:34 +00:00
spinlock_t delayed_iput_lock ;
struct list_head delayed_iputs ;
2018-12-03 11:06:52 -05:00
atomic_t nr_delayed_iputs ;
wait_queue_head_t delayed_iputs_wait ;
2009-11-12 09:36:34 +00:00
2013-04-24 16:57:33 +00:00
atomic64_t tree_mod_seq ;
2012-05-16 17:55:38 +02:00
Btrfs: fix race between adding and putting tree mod seq elements and nodes
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-22 12:23:20 +00:00
/* this protects tree_mod_log and tree_mod_seq_list */
2012-05-16 17:55:38 +02:00
rwlock_t tree_mod_log_lock ;
struct rb_root tree_mod_log ;
Btrfs: fix race between adding and putting tree mod seq elements and nodes
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-22 12:23:20 +00:00
struct list_head tree_mod_seq_list ;
2012-05-16 17:55:38 +02:00
2008-11-06 22:02:51 -05:00
atomic_t async_delalloc_pages ;
2008-04-09 16:28:12 -04:00
2008-07-24 11:57:52 -04:00
/*
2013-05-15 07:48:23 +00:00
* this is used to protect the following list - - ordered_roots .
2008-07-24 11:57:52 -04:00
*/
2013-05-15 07:48:23 +00:00
spinlock_t ordered_root_lock ;
2009-03-31 13:27:11 -04:00
/*
2013-05-15 07:48:23 +00:00
* all fs / file tree roots in which there are data = ordered extents
* pending writeback are added into this list .
*
2009-03-31 13:27:11 -04:00
* these can span multiple transactions and basically include
* every dirty data page that isn ' t from nodatacow
*/
2013-05-15 07:48:23 +00:00
struct list_head ordered_roots ;
2009-03-31 13:27:11 -04:00
2014-03-06 13:55:03 +08:00
struct mutex delalloc_root_mutex ;
2013-05-15 07:48:22 +00:00
spinlock_t delalloc_root_lock ;
/* all fs/file tree roots that have delalloc inodes. */
struct list_head delalloc_roots ;
2008-07-24 11:57:52 -04:00
2008-06-11 16:50:36 -04:00
/*
* there is a pool of worker threads for checksumming during writes
* and a pool for checksumming after reads . This is because readers
* can run with FS locks held , and the writers may be waiting for
* those locks . We don ' t want ordering in the pending list to cause
* deadlocks , and so the two are serviced separately .
2008-06-12 14:46:17 -04:00
*
* A third pool does submit_bio to avoid deadlocking with the other
* two
2008-06-11 16:50:36 -04:00
*/
2014-02-28 10:46:19 +08:00
struct btrfs_workqueue * workers ;
struct btrfs_workqueue * delalloc_workers ;
struct btrfs_workqueue * flush_workers ;
struct btrfs_workqueue * endio_workers ;
struct btrfs_workqueue * endio_meta_workers ;
struct btrfs_workqueue * endio_raid56_workers ;
struct btrfs_workqueue * rmw_workers ;
struct btrfs_workqueue * endio_meta_write_workers ;
struct btrfs_workqueue * endio_write_workers ;
struct btrfs_workqueue * endio_freespace_worker ;
struct btrfs_workqueue * caching_workers ;
struct btrfs_workqueue * readahead_workers ;
2011-06-30 14:42:28 -04:00
2008-07-17 12:53:51 -04:00
/*
* fixup workers take dirty pages that didn ' t properly go through
* the cow mechanism and make them safe to write . It happens
* for the sys_munmap function call path
*/
2014-02-28 10:46:19 +08:00
struct btrfs_workqueue * fixup_workers ;
struct btrfs_workqueue * delayed_workers ;
2014-05-22 16:18:52 -07:00
2008-06-25 16:01:31 -04:00
struct task_struct * transaction_kthread ;
struct task_struct * cleaner_kthread ;
2018-02-13 17:50:42 +08:00
u32 thread_pool_size ;
2008-06-11 16:50:36 -04:00
2013-11-01 13:07:04 -04:00
struct kobject * space_info_kobj ;
2020-06-28 13:07:15 +08:00
struct kobject * qgroups_kobj ;
2007-03-20 14:38:32 -04:00
2007-11-16 14:57:08 -05:00
u64 total_pinned ;
2009-03-13 11:00:37 -04:00
2013-01-29 10:09:20 +00:00
/* used to keep from writing metadata until there is a nice batch */
struct percpu_counter dirty_metadata_bytes ;
2013-01-29 10:10:51 +00:00
struct percpu_counter delalloc_bytes ;
2020-10-09 09:28:20 -04:00
struct percpu_counter ordered_bytes ;
2013-01-29 10:09:20 +00:00
s32 dirty_metadata_batch ;
2013-01-29 10:10:51 +00:00
s32 delalloc_batch ;
2008-03-24 15:01:56 -04:00
struct list_head dirty_cowonly_roots ;
2008-03-24 15:02:07 -04:00
struct btrfs_fs_devices * fs_devices ;
2009-03-10 12:39:20 -04:00
/*
2018-03-20 15:25:25 -04:00
* The space_info list is effectively read only after initial
* setup . It is populated at mount time and cleaned up after
* all block groups are removed . RCU is used to protect it .
2009-03-10 12:39:20 -04:00
*/
2008-03-24 15:01:59 -04:00
struct list_head space_info ;
2009-03-10 12:39:20 -04:00
2012-07-09 20:21:07 -06:00
struct btrfs_space_info * data_sinfo ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
struct reloc_control * reloc_ctl ;
btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-28 08:31:28 +02:00
/* data_alloc_cluster is only used in ssd_spread mode */
2009-04-03 09:47:43 -04:00
struct btrfs_free_cluster data_alloc_cluster ;
/* all metadata allocations go through this cluster */
struct btrfs_free_cluster meta_alloc_cluster ;
2008-04-04 15:40:00 -04:00
2011-05-24 15:35:30 -04:00
/* auto defrag inodes go here */
spinlock_t defrag_inodes_lock ;
struct rb_root defrag_inodes ;
atomic_t defrag_running ;
2013-01-29 10:13:12 +00:00
/* Used to protect avail_{data, metadata, system}_alloc_bits */
seqlock_t profiles_lock ;
2012-01-16 22:04:47 +02:00
/*
* these three are in extended format ( availability of single
* chunks is denoted by BTRFS_AVAIL_ALLOC_BIT_SINGLE bit , other
* types are denoted by corresponding BTRFS_BLOCK_GROUP_ * bits )
*/
2008-04-04 15:40:00 -04:00
u64 avail_data_alloc_bits ;
u64 avail_metadata_alloc_bits ;
u64 avail_system_alloc_bits ;
2008-04-28 15:29:42 -04:00
2012-01-16 22:04:47 +02:00
/* restriper state */
spinlock_t balance_lock ;
struct mutex balance_mutex ;
2012-01-16 22:04:49 +02:00
atomic_t balance_pause_req ;
2012-01-16 22:04:49 +02:00
atomic_t balance_cancel_req ;
2012-01-16 22:04:47 +02:00
struct btrfs_balance_control * balance_ctl ;
2012-01-16 22:04:49 +02:00
wait_queue_head_t balance_wait_q ;
2012-01-16 22:04:47 +02:00
2018-02-26 16:46:05 +08:00
u32 data_chunk_allocations ;
u32 metadata_ratio ;
2009-04-21 17:40:57 -04:00
2008-04-28 15:29:42 -04:00
void * bdev_holder ;
2011-01-06 19:30:25 +08:00
2011-03-08 14:14:00 +01:00
/* private scrub information */
struct mutex scrub_lock ;
atomic_t scrubs_running ;
atomic_t scrub_pause_req ;
atomic_t scrubs_paused ;
atomic_t scrub_cancel_req ;
wait_queue_head_t scrub_pause_wait ;
2019-02-12 16:51:18 +01:00
/*
* The worker pointers are NULL iff the refcount is 0 , ie . scrub is not
* running .
*/
2019-01-30 14:45:02 +08:00
refcount_t scrub_workers_refcnt ;
2014-02-28 10:46:19 +08:00
struct btrfs_workqueue * scrub_workers ;
struct btrfs_workqueue * scrub_wr_completion_workers ;
2015-06-04 20:09:15 +08:00
struct btrfs_workqueue * scrub_parity_workers ;
2011-03-08 14:14:00 +01:00
2019-12-13 16:22:14 -08:00
struct btrfs_discard_ctl discard_ctl ;
2011-11-09 13:44:05 +01:00
# ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
u32 check_integrity_print_mask ;
# endif
2011-09-13 12:56:09 +02:00
/* is qgroup tracking in a consistent state? */
u64 qgroup_flags ;
/* holds configuration and tracking. Protected by qgroup_lock */
struct rb_root qgroup_tree ;
spinlock_t qgroup_lock ;
2013-05-06 11:03:27 +00:00
/*
* used to avoid frequently calling ulist_alloc ( ) / ulist_free ( )
* when doing qgroup accounting , it must be protected by qgroup_lock .
*/
struct ulist * qgroup_ulist ;
btrfs: fix lockdep splat when enabling and disabling qgroups
When running test case btrfs/017 from fstests, lockdep reported the
following splat:
[ 1297.067385] ======================================================
[ 1297.067708] WARNING: possible circular locking dependency detected
[ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
[ 1297.068322] ------------------------------------------------------
[ 1297.068629] btrfs/189080 is trying to acquire lock:
[ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.069274]
but task is already holding lock:
[ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.070219]
which lock already depends on the new lock.
[ 1297.071131]
the existing dependency chain (in reverse order) is:
[ 1297.071721]
-> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
[ 1297.072375] lock_acquire+0xd8/0x490
[ 1297.072710] __mutex_lock+0xa3/0xb30
[ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
[ 1297.073421] create_subvol+0x194/0x990 [btrfs]
[ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
[ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs]
[ 1297.075245] __x64_sys_ioctl+0x83/0xb0
[ 1297.075617] do_syscall_64+0x33/0x80
[ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.076380]
-> #0 (sb_internal#2){.+.+}-{0:0}:
[ 1297.077166] check_prev_add+0x91/0xc60
[ 1297.077572] __lock_acquire+0x1740/0x3110
[ 1297.077984] lock_acquire+0xd8/0x490
[ 1297.078411] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.079789] __x64_sys_ioctl+0x83/0xb0
[ 1297.080232] do_syscall_64+0x33/0x80
[ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.081139]
other info that might help us debug this:
[ 1297.082536] Possible unsafe locking scenario:
[ 1297.083510] CPU0 CPU1
[ 1297.084005] ---- ----
[ 1297.084500] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.084994] lock(sb_internal#2);
[ 1297.085485] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.085974] lock(sb_internal#2);
[ 1297.086454]
*** DEADLOCK ***
[ 1297.087880] 3 locks held by btrfs/189080:
[ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
[ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.089771]
stack backtrace:
[ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
[ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1297.092123] Call Trace:
[ 1297.092629] dump_stack+0x8d/0xb5
[ 1297.093115] check_noncircular+0xff/0x110
[ 1297.093596] check_prev_add+0x91/0xc60
[ 1297.094076] ? kvm_clock_read+0x14/0x30
[ 1297.094553] ? kvm_sched_clock_read+0x5/0x10
[ 1297.095029] __lock_acquire+0x1740/0x3110
[ 1297.095510] lock_acquire+0xd8/0x490
[ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.096476] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.098904] ? do_user_addr_fault+0x20c/0x430
[ 1297.099382] ? kvm_clock_read+0x14/0x30
[ 1297.099854] ? kvm_sched_clock_read+0x5/0x10
[ 1297.100328] ? sched_clock+0x5/0x10
[ 1297.100801] ? sched_clock_cpu+0x12/0x180
[ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0
[ 1297.101739] __x64_sys_ioctl+0x83/0xb0
[ 1297.102207] do_syscall_64+0x33/0x80
[ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.103148] RIP: 0033:0x7f773ff65d87
This is because during the quota enable ioctl we lock first the mutex
qgroup_ioctl_lock and then start a transaction, and starting a transaction
acquires a fs freeze semaphore (at the VFS level). However, every other
code path, except for the quota disable ioctl path, we do the opposite:
we start a transaction and then lock the mutex.
So fix this by making the quota enable and disable paths to start the
transaction without having the mutex locked, and then, after starting the
transaction, lock the mutex and check if some other task already enabled
or disabled the quotas, bailing with success if that was the case.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23 18:31:02 +00:00
/*
* Protect user change for quota operations . If a transaction is needed ,
* it must be started before locking this lock .
*/
2013-04-07 10:50:16 +00:00
struct mutex qgroup_ioctl_lock ;
2011-09-13 12:56:09 +02:00
/* list of dirty qgroups to be written at next commit */
struct list_head dirty_qgroups ;
2015-04-17 10:23:16 +08:00
/* used by qgroup for an efficient tree traversal */
2011-09-13 12:56:09 +02:00
u64 qgroup_seq ;
2011-11-09 13:44:05 +01:00
2013-04-25 16:04:51 +00:00
/* qgroup rescan items */
struct mutex qgroup_rescan_lock ; /* protects the progress item */
struct btrfs_key qgroup_rescan_progress ;
2014-02-28 10:46:19 +08:00
struct btrfs_workqueue * qgroup_rescan_workers ;
2013-05-06 19:14:17 +00:00
struct completion qgroup_rescan_completion ;
Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-28 15:47:24 +00:00
struct btrfs_work qgroup_rescan_work ;
2016-08-15 12:10:33 -04:00
bool qgroup_rescan_running ; /* protected by qgroup_rescan_lock */
2013-04-25 16:04:51 +00:00
2011-01-06 19:30:25 +08:00
/* filesystem state */
2013-01-29 10:14:48 +00:00
unsigned long fs_state ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
struct btrfs_delayed_root * delayed_root ;
2011-11-03 15:17:42 -04:00
2011-05-23 14:30:00 +02:00
/* readahead tree */
spinlock_t reada_lock ;
struct radix_tree_root reada_tree ;
2011-11-06 03:05:08 -05:00
2016-01-07 18:38:48 +08:00
/* readahead works cnt */
atomic_t reada_works_cnt ;
2013-12-16 13:24:27 -05:00
/* Extent buffer radix tree */
spinlock_t buffer_lock ;
2020-10-21 14:25:05 +08:00
/* Entries are eb->start / sectorsize */
2013-12-16 13:24:27 -05:00
struct radix_tree_root buffer_radix ;
2011-11-03 15:17:42 -04:00
/* next backup root to be overwritten */
int backup_root_index ;
2012-08-01 18:56:49 +02:00
2012-11-05 17:26:40 +01:00
/* device replace state */
struct btrfs_dev_replace dev_replace ;
2012-11-05 17:54:08 +01:00
2013-08-15 17:11:21 +02:00
struct semaphore uuid_tree_rescan_sem ;
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-13 17:29:04 -07:00
/* Used to reclaim the metadata space in the background. */
struct work_struct async_reclaim_work ;
2020-07-21 10:22:33 -04:00
struct work_struct async_data_reclaim_work ;
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 09:28:22 -04:00
struct work_struct preempt_reclaim_work ;
2014-09-18 11:20:02 -04:00
spinlock_t unused_bgs_lock ;
struct list_head unused_bgs ;
2015-01-29 19:18:25 +00:00
struct mutex unused_bg_unpin_mutex ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
struct mutex delete_unused_bgs_mutex ;
2014-09-23 13:40:08 +08:00
2016-06-15 09:22:56 -04:00
/* Cached block sizes */
u32 nodesize ;
u32 sectorsize ;
2020-07-01 20:45:04 +02:00
/* ilog2 of sectorsize, use to avoid 64bit division */
u32 sectorsize_bits ;
2020-07-02 11:10:18 +02:00
u32 csum_size ;
2020-07-02 10:54:11 +02:00
u32 csums_per_leaf ;
2016-06-15 09:22:56 -04:00
u32 stripesize ;
2017-09-29 15:43:50 -04:00
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
/* Block groups and devices containing active swapfiles. */
spinlock_t swapfile_pins_lock ;
struct rb_root swapfile_pins ;
2019-06-03 16:58:56 +02:00
struct crypto_shash * csum_shash ;
Btrfs: prevent send failures and crashes due to concurrent relocation
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 16:44:09 +01:00
/*
* Number of send operations in progress .
* Updated while holding fs_info : : balance_mutex .
*/
int send_in_progress ;
2020-08-25 10:02:32 -05:00
/* Type of exclusive operation running */
unsigned long exclusive_operation ;
2020-11-10 20:26:08 +09:00
/*
* Zone size > 0 when in ZONED mode , otherwise it ' s used for a check
* if the mode is enabled
*/
union {
u64 zone_size ;
u64 zoned ;
} ;
2020-11-10 20:26:09 +09:00
/* Max size to emit ZONE_APPEND write command */
u64 max_zone_append_size ;
2021-02-04 19:22:08 +09:00
struct mutex zoned_meta_io_lock ;
2021-02-04 19:22:18 +09:00
spinlock_t treelog_bg_lock ;
u64 treelog_bg ;
2020-11-10 20:26:09 +09:00
2017-09-29 15:43:50 -04:00
# ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock ;
struct rb_root block_tree ;
# endif
2019-12-13 16:22:18 -08:00
# ifdef CONFIG_BTRFS_DEBUG
struct kobject * debug_kobj ;
2019-12-13 16:22:19 -08:00
struct kobject * discard_debug_kobj ;
2020-01-24 09:33:00 -05:00
struct list_head allocated_roots ;
2020-02-14 16:11:40 -05:00
spinlock_t eb_leak_lock ;
struct list_head allocated_ebs ;
2019-12-13 16:22:18 -08:00
# endif
2007-11-16 14:57:08 -05:00
} ;
2008-03-24 15:01:56 -04:00
2016-06-15 09:22:56 -04:00
static inline struct btrfs_fs_info * btrfs_sb ( struct super_block * sb )
{
return sb - > s_fs_info ;
}
2014-04-02 19:51:05 +08:00
/*
* The state of btrfs root
*/
2018-11-27 14:57:19 +01:00
enum {
/*
* btrfs_record_root_in_trans is a multi - step process , and it can race
* with the balancing code . But the race is very small , and only the
* first time the root is added to each transaction . So IN_TRANS_SETUP
* is used to tell us when more checks are required
*/
BTRFS_ROOT_IN_TRANS_SETUP ,
2020-05-15 14:01:40 +08:00
/*
* Set if tree blocks of this root can be shared by other roots .
* Only subvolume trees and their reloc trees have this bit set .
* Conflicts with TRACK_DIRTY bit .
*
* This affects two things :
*
* - How balance works
* For shareable roots , we need to use reloc tree and do path
* replacement for balance , and need various pre / post hooks for
* snapshot creation to handle them .
*
* While for non - shareable trees , we just simply do a tree search
* with COW .
*
* - How dirty roots are tracked
* For shareable roots , btrfs_record_root_in_trans ( ) is needed to
* track them , while non - subvolume roots have TRACK_DIRTY bit , they
* don ' t need to set this manually .
*/
BTRFS_ROOT_SHAREABLE ,
2018-11-27 14:57:19 +01:00
BTRFS_ROOT_TRACK_DIRTY ,
BTRFS_ROOT_IN_RADIX ,
BTRFS_ROOT_ORPHAN_ITEM_INSERTED ,
BTRFS_ROOT_DEFRAG_RUNNING ,
BTRFS_ROOT_FORCE_COW ,
BTRFS_ROOT_MULTI_LOG_TASKS ,
BTRFS_ROOT_DIRTY ,
2018-11-30 11:52:13 -05:00
BTRFS_ROOT_DELETING ,
2019-01-23 15:15:14 +08:00
/*
* Reloc tree is orphan , only kept here for qgroup delayed subtree scan
*
* Set for the subvolume tree owning the reloc tree .
*/
BTRFS_ROOT_DEAD_RELOC_TREE ,
2019-02-06 15:46:14 -05:00
/* Mark dead root stored on device whose cleanup needs to be resumed */
BTRFS_ROOT_DEAD_TREE ,
btrfs: do not block inode logging for so long during transaction commit
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.
This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).
Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:
1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers
They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.
So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
The following script that uses dbench was used to measure the impact of
the whole patchset:
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).
Before patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236
Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
After patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259
Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
+12.7% throughput, -8.2% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-25 12:19:28 +00:00
/* The root has a log tree. Used for subvolume roots and the tree root. */
2020-06-15 10:38:44 +01:00
BTRFS_ROOT_HAS_LOG_TREE ,
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-13 18:50:48 +08:00
/* Qgroup flushing is in progress */
BTRFS_ROOT_QGROUP_FLUSHING ,
2018-11-27 14:57:19 +01:00
} ;
2014-04-02 19:51:05 +08:00
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
/*
* Record swapped tree blocks of a subvolume tree for delayed subtree trace
* code . For detail check comment in fs / btrfs / qgroup . c .
*/
struct btrfs_qgroup_swapped_blocks {
spinlock_t lock ;
/* RM_EMPTY_ROOT() of above blocks[] */
bool swapped ;
struct rb_root blocks [ BTRFS_MAX_LEVEL ] ;
} ;
2007-03-20 14:38:32 -04:00
/*
* in ram representation of the tree . extent_root is used for all allocations
2007-04-25 15:52:25 -04:00
* and for the extent tree extent_root root .
2007-03-20 14:38:32 -04:00
*/
struct btrfs_root {
2007-10-15 16:14:19 -04:00
struct extent_buffer * node ;
2008-06-25 16:01:30 -04:00
2007-10-15 16:14:19 -04:00
struct extent_buffer * commit_root ;
2008-09-05 16:13:11 -04:00
struct btrfs_root * log_root ;
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
struct btrfs_root * reloc_root ;
2008-07-28 15:32:19 -04:00
2014-04-02 19:51:05 +08:00
unsigned long state ;
2007-03-15 12:56:47 -04:00
struct btrfs_root_item root_item ;
struct btrfs_key root_key ;
2007-03-20 14:38:32 -04:00
struct btrfs_fs_info * fs_info ;
2008-09-11 16:17:57 -04:00
struct extent_io_tree dirty_log_pages ;
2008-06-25 16:01:30 -04:00
struct mutex objectid_mutex ;
2009-01-21 12:54:03 -05:00
2010-05-16 10:46:25 -04:00
spinlock_t accounting_lock ;
struct btrfs_block_rsv * block_rsv ;
2008-09-05 16:13:11 -04:00
struct mutex log_mutex ;
2009-01-21 12:54:03 -05:00
wait_queue_head_t log_writer_wait ;
wait_queue_head_t log_commit_wait [ 2 ] ;
2014-02-20 18:08:58 +08:00
struct list_head log_ctxs [ 2 ] ;
btrfs: remove no longer needed use of log_writers for the log root tree
When syncing the log, we used to update the log root tree without holding
neither the log_mutex of the subvolume root nor the log_mutex of log root
tree.
We used to have two critical sections delimited by the log_mutex of the
log root tree, so in the first one we incremented the log_writers of the
log root tree and on the second one we decremented it and waited for the
log_writers counter to go down to zero. This was because the update of
the log root tree happened between the two critical sections.
The use of two critical sections allowed a little bit more of parallelism
and required the use of the log_writers counter, necessary to make sure
we didn't miss any log root tree update when we have multiple tasks trying
to sync the log in parallel.
However after commit 06989c799f0481 ("Btrfs: fix race updating log root
item during fsync") the log root tree update was moved into a critical
section delimited by the subvolume's log_mutex. Later another commit
moved the log tree update from that critical section into the second
critical section delimited by the log_mutex of the log root tree. Both
commits addressed different bugs.
The end result is that the first critical section delimited by the
log_mutex of the log root tree became pointless, since there's nothing
done between it and the second critical section, we just have an unlock
of the log_mutex followed by a lock operation. This means we can merge
both critical sections, as the first one does almost nothing now, and we
can stop using the log_writers counter of the log root tree, which was
incremented in the first critical section and decremented in the second
criticial section, used to make sure no one in the second critical section
started writeback of the log root tree before some other task updated it.
So just remove the mutex_unlock() followed by mutex_lock() of the log root
tree, as well as the use of the log_writers counter for the log root tree.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-02 12:32:40 +01:00
/* Used only for log trees of subvolumes, not for the log root tree */
2009-01-21 12:54:03 -05:00
atomic_t log_writers ;
atomic_t log_commit [ 2 ] ;
2020-07-02 12:32:31 +01:00
/* Used only for log trees of subvolumes, not for the log root tree */
2012-09-06 04:04:27 -06:00
atomic_t log_batch ;
2014-02-20 18:08:56 +08:00
int log_transid ;
2014-02-20 18:08:59 +08:00
/* No matter the commit succeeds or not*/
int log_transid_committed ;
/* Just be updated when the commit succeeds. */
2014-02-20 18:08:56 +08:00
int last_log_commit ;
2009-10-08 15:30:04 -04:00
pid_t log_start_pid ;
2008-08-04 23:17:27 -04:00
2007-04-09 10:42:37 -04:00
u64 last_trans ;
2007-10-15 16:14:19 -04:00
2007-03-20 14:38:32 -04:00
u32 type ;
2009-09-21 15:56:00 -04:00
2020-12-07 17:32:35 +02:00
u64 free_objectid ;
2011-06-13 20:00:16 -04:00
2007-08-07 16:15:09 -04:00
struct btrfs_key defrag_progress ;
2008-05-24 14:04:53 -04:00
struct btrfs_key defrag_max ;
2008-03-24 15:01:56 -04:00
2020-05-15 14:01:40 +08:00
/* The dirty list is only used by non-shareable roots */
2008-03-24 15:01:56 -04:00
struct list_head dirty_list ;
2008-07-24 12:17:14 -04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
struct list_head root_list ;
2012-10-12 15:27:49 -04:00
spinlock_t log_extents_lock [ 2 ] ;
struct list_head logged_list [ 2 ] ;
2010-05-16 10:49:58 -04:00
int orphan_cleanup_state ;
2008-11-17 20:42:26 -05:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
spinlock_t inode_lock ;
/* red-black tree that keeps track of in-memory inodes */
struct rb_root inode_tree ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
/*
* radix tree that keeps track of delayed nodes of every inode ,
* protected by inode_lock
*/
struct radix_tree_root delayed_nodes_tree ;
2008-11-17 20:42:26 -05:00
/*
* right now this just gets used so that a root has its own devid
* for stat . It may be used for more later
*/
2011-07-07 15:44:25 -04:00
dev_t anon_dev ;
2011-11-14 20:48:06 -05:00
2012-12-07 09:28:54 +00:00
spinlock_t root_item_lock ;
2017-03-03 10:55:18 +02:00
refcount_t refs ;
2013-05-15 07:48:22 +00:00
2014-03-06 13:55:03 +08:00
struct mutex delalloc_mutex ;
2013-05-15 07:48:22 +00:00
spinlock_t delalloc_lock ;
/*
* all of the inodes that have delalloc bytes . It is possible for
* this list to be empty even when there is still dirty data = ordered
* extents waiting to finish IO .
*/
struct list_head delalloc_inodes ;
struct list_head delalloc_root ;
u64 nr_delalloc_inodes ;
2014-03-06 13:55:02 +08:00
struct mutex ordered_extent_mutex ;
2013-05-15 07:48:23 +00:00
/*
* this is used by the balancing code to wait for all the pending
* ordered extents
*/
spinlock_t ordered_extent_lock ;
/*
* all of the data = ordered extents pending writeback
* these can span multiple transactions and basically include
* every dirty data page that isn ' t from nodatacow
*/
struct list_head ordered_extents ;
struct list_head ordered_root ;
u64 nr_ordered_extents ;
2013-12-16 17:34:17 +01:00
2019-01-23 15:15:14 +08:00
/*
* Not empty if this subvolume root has gone through tree block swap
* ( relocation )
*
* Will be used by reloc_control : : dirty_subvol_roots .
*/
struct list_head reloc_dirty_list ;
2013-12-16 17:34:17 +01:00
/*
* Number of currently running SEND ioctls to prevent
* manipulation with the read - only status via SUBVOL_SETFLAGS
*/
int send_in_progress ;
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 16:43:42 +01:00
/*
* Number of currently running deduplication operations that have a
* destination inode belonging to this root . Protected by the lock
* root_item_lock .
*/
int dedupe_in_progress ;
2020-01-30 14:59:45 +02:00
/* For exclusion of snapshot creation and nocow writes */
struct btrfs_drew_lock snapshot_lock ;
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.
The steps leading to this problem are:
1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.
2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).
3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.
So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:
1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.
2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).
3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.
Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 10:30:30 +08:00
atomic_t snapshot_force_cow ;
2017-12-12 15:34:34 +08:00
/* For qgroup metadata reserved space */
spinlock_t qgroup_meta_rsv_lock ;
u64 qgroup_meta_rsv_pertrans ;
u64 qgroup_meta_rsv_prealloc ;
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-13 18:50:48 +08:00
wait_queue_head_t qgroup_flush_wait ;
2018-08-17 17:38:12 +02:00
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
/* Number of active swapfiles */
atomic_t nr_swapfiles ;
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
/* Record pairs of swapped blocks for qgroup */
struct btrfs_qgroup_swapped_blocks swapped_blocks ;
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:14:50 +01:00
/* Used only by log trees, when logging csum items */
struct extent_io_tree log_csum_range ;
2018-08-17 17:38:12 +02:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
u64 alloc_bytenr ;
# endif
2020-01-24 09:33:00 -05:00
# ifdef CONFIG_BTRFS_DEBUG
struct list_head leak_list ;
# endif
2007-03-15 12:56:47 -04:00
} ;
2017-05-22 13:16:11 +03:00
2020-09-08 11:27:22 +01:00
/*
* Structure that conveys information about an extent that is going to replace
* all the extents in a file range .
*/
struct btrfs_replace_extent_info {
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 11:09:50 +01:00
u64 disk_offset ;
u64 disk_len ;
u64 data_offset ;
u64 data_len ;
u64 file_offset ;
2020-09-08 11:27:21 +01:00
/* Pointer to a file extent item of type regular or prealloc. */
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 11:09:50 +01:00
char * extent_buf ;
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:
[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108] vfs_fallocate+0x14d/0x290
[ 1500.621112] ksys_fallocate+0x3a/0x70
[ 1500.621117] __x64_sys_fallocate+0x1a/0x20
[ 1500.621120] do_syscall_64+0x33/0x80
[ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.
When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.
However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:
$ cat reproducer.sh
#!/bin/bash
umount /dev/sdj &> /dev/null
mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
mount /dev/sdj /mnt/sdj
# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done
# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync
xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
umount /mnt/sdj
$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device
$ dmesg
<shows the same stack trace pasted before>
To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.
A test case for fstests will follow up soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-08 11:27:20 +01:00
/*
* Set to true when attempting to replace a file range with a new extent
* described by this structure , set to false when attempting to clone an
* existing extent into a file range .
*/
bool is_new_extent ;
/* Meaningful only if is_new_extent is true. */
int qgroup_reserved ;
/*
* Meaningful only if is_new_extent is true .
* Used to track how many extent items we have already inserted in a
* subvolume tree that refer to the extent described by this structure ,
* so that we know when to create a new delayed ref or update an existing
* one .
*/
int insertions ;
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 11:09:50 +01:00
} ;
2020-11-04 11:07:32 +00:00
/* Arguments for btrfs_drop_extents() */
struct btrfs_drop_extents_args {
/* Input parameters */
/*
* If NULL , btrfs_drop_extents ( ) will allocate and free its own path .
* If ' replace_extent ' is true , this must not be NULL . Also the path
* is always released except if ' replace_extent ' is true and
* btrfs_drop_extents ( ) sets ' extent_inserted ' to true , in which case
* the path is kept locked .
*/
struct btrfs_path * path ;
/* Start offset of the range to drop extents from */
u64 start ;
/* End (exclusive, last byte + 1) of the range to drop extents from */
u64 end ;
/* If true drop all the extent maps in the range */
bool drop_cache ;
/*
* If true it means we want to insert a new extent after dropping all
* the extents in the range . If this is true , the ' extent_item_size '
* parameter must be set as well and the ' extent_inserted ' field will
* be set to true by btrfs_drop_extents ( ) if it could insert the new
* extent .
* Note : when this is set to true the path must not be NULL .
*/
bool replace_extent ;
/*
* Used if ' replace_extent ' is true . Size of the file extent item to
* insert after dropping all existing extents in the range
*/
u32 extent_item_size ;
/* Output parameters */
/*
* Set to the minimum between the input parameter ' end ' and the end
* ( exclusive , last byte + 1 ) of the last dropped extent . This is always
* set even if btrfs_drop_extents ( ) returns an error .
*/
u64 drop_end ;
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 11:07:34 +00:00
/*
* The number of allocated bytes found in the range . This can be smaller
* than the range ' s length when there are holes in the range .
*/
u64 bytes_found ;
2020-11-04 11:07:32 +00:00
/*
* Only set if ' replace_extent ' is true . Set to true if we were able
* to insert a replacement extent after dropping all extents in the
* range , otherwise set to false by btrfs_drop_extents ( ) .
* Also , if btrfs_drop_extents ( ) has set this to true it means it
* returned with the path locked , otherwise if it has set this to
* false it has returned with the path released .
*/
bool extent_inserted ;
} ;
2017-07-24 15:14:25 -04:00
struct btrfs_file_private {
void * filldir_buf ;
} ;
2007-03-15 12:56:47 -04:00
2016-06-15 09:22:56 -04:00
static inline u32 BTRFS_LEAF_DATA_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 10:33:06 -04:00
{
2017-05-22 13:16:11 +03:00
return info - > nodesize - sizeof ( struct btrfs_header ) ;
2016-06-15 10:33:06 -04:00
}
2017-05-29 09:43:43 +03:00
# define BTRFS_LEAF_DATA_OFFSET offsetof(struct btrfs_leaf, items)
2016-06-15 09:22:56 -04:00
static inline u32 BTRFS_MAX_ITEM_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 10:33:06 -04:00
{
2016-06-15 09:22:56 -04:00
return BTRFS_LEAF_DATA_SIZE ( info ) - sizeof ( struct btrfs_item ) ;
2016-06-15 10:33:06 -04:00
}
2016-06-15 09:22:56 -04:00
static inline u32 BTRFS_NODEPTRS_PER_BLOCK ( const struct btrfs_fs_info * info )
2016-06-15 10:33:06 -04:00
{
2016-06-15 09:22:56 -04:00
return BTRFS_LEAF_DATA_SIZE ( info ) / sizeof ( struct btrfs_key_ptr ) ;
2016-06-15 10:33:06 -04:00
}
# define BTRFS_FILE_EXTENT_INLINE_DATA_START \
( offsetof ( struct btrfs_file_extent_item , disk_bytenr ) )
2016-06-15 09:22:56 -04:00
static inline u32 BTRFS_MAX_INLINE_DATA_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 10:33:06 -04:00
{
2016-06-15 09:22:56 -04:00
return BTRFS_MAX_ITEM_SIZE ( info ) -
2016-06-15 10:33:06 -04:00
BTRFS_FILE_EXTENT_INLINE_DATA_START ;
}
2016-06-15 09:22:56 -04:00
static inline u32 BTRFS_MAX_XATTR_SIZE ( const struct btrfs_fs_info * info )
2016-06-15 10:33:06 -04:00
{
2016-06-15 09:22:56 -04:00
return BTRFS_MAX_ITEM_SIZE ( info ) - sizeof ( struct btrfs_dir_item ) ;
2016-06-15 10:33:06 -04:00
}
2011-06-28 15:10:37 +00:00
/*
* Flags for mount options .
*
* Note : don ' t forget to add new options to btrfs_show_options ( )
*/
2008-01-09 09:23:21 -05:00
# define BTRFS_MOUNT_NODATASUM (1 << 0)
# define BTRFS_MOUNT_NODATACOW (1 << 1)
# define BTRFS_MOUNT_NOBARRIER (1 << 2)
2008-01-18 10:54:22 -05:00
# define BTRFS_MOUNT_SSD (1 << 3)
2008-05-13 13:46:40 -04:00
# define BTRFS_MOUNT_DEGRADED (1 << 4)
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
# define BTRFS_MOUNT_COMPRESS (1 << 5)
2009-04-02 16:49:40 -04:00
# define BTRFS_MOUNT_NOTREELOG (1 << 6)
2009-04-02 16:59:01 -04:00
# define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
2009-06-09 20:28:34 -04:00
# define BTRFS_MOUNT_SSD_SPREAD (1 << 8)
2009-06-10 09:51:32 -04:00
# define BTRFS_MOUNT_NOSSD (1 << 9)
2019-12-13 16:22:11 -08:00
# define BTRFS_MOUNT_DISCARD_SYNC (1 << 10)
2010-01-28 16:18:15 -05:00
# define BTRFS_MOUNT_FORCE_COMPRESS (1 << 11)
2010-06-21 14:48:16 -04:00
# define BTRFS_MOUNT_SPACE_CACHE (1 << 12)
2010-09-21 14:21:34 -04:00
# define BTRFS_MOUNT_CLEAR_CACHE (1 << 13)
2010-10-29 15:46:43 -04:00
# define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 << 14)
2011-02-16 13:10:41 -05:00
# define BTRFS_MOUNT_ENOSPC_DEBUG (1 << 15)
2011-05-24 15:35:30 -04:00
# define BTRFS_MOUNT_AUTO_DEFRAG (1 << 16)
2020-11-26 15:10:39 +02:00
/* bit 17 is free */
2016-01-19 10:23:02 +08:00
# define BTRFS_MOUNT_USEBACKUPROOT (1 << 18)
2012-01-16 22:04:48 +02:00
# define BTRFS_MOUNT_SKIP_BALANCE (1 << 19)
2012-01-16 15:27:58 -05:00
# define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20)
# define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
2011-10-03 23:22:31 -04:00
# define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
2013-08-15 17:11:24 +02:00
# define BTRFS_MOUNT_RESCAN_UUID_TREE (1 << 23)
2015-09-23 14:54:14 -04:00
# define BTRFS_MOUNT_FRAGMENT_DATA (1 << 24)
# define BTRFS_MOUNT_FRAGMENT_METADATA (1 << 25)
2015-12-18 11:11:10 -08:00
# define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26)
2016-01-19 10:23:03 +08:00
# define BTRFS_MOUNT_NOLOGREPLAY (1 << 27)
2017-09-29 15:43:48 -04:00
# define BTRFS_MOUNT_REF_VERIFY (1 << 28)
2019-12-13 16:22:14 -08:00
# define BTRFS_MOUNT_DISCARD_ASYNC (1 << 29)
2020-10-16 11:29:18 -04:00
# define BTRFS_MOUNT_IGNOREBADROOTS (1 << 30)
2020-10-16 11:29:19 -04:00
# define BTRFS_MOUNT_IGNOREDATACSUMS (1 << 31)
2007-12-14 15:30:32 -05:00
2013-08-01 18:14:52 +02:00
# define BTRFS_DEFAULT_COMMIT_INTERVAL (30)
2015-10-08 14:14:16 +02:00
# define BTRFS_DEFAULT_MAX_INLINE (2048)
2013-08-01 18:14:52 +02:00
2007-12-14 15:30:32 -05:00
# define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
# define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
2013-02-20 23:32:52 -07:00
# define btrfs_raw_test_opt(o, opt) ((o) & BTRFS_MOUNT_##opt)
2016-06-09 21:38:35 -04:00
# define btrfs_test_opt(fs_info, opt) ((fs_info)->mount_opt & \
2007-12-14 15:30:32 -05:00
BTRFS_MOUNT_ # # opt )
2014-02-05 15:26:17 +01:00
2016-06-09 21:38:35 -04:00
# define btrfs_set_and_info(fs_info, opt, fmt, args...) \
2020-07-06 11:59:36 -03:00
do { \
2016-06-09 21:38:35 -04:00
if ( ! btrfs_test_opt ( fs_info , opt ) ) \
btrfs_info ( fs_info , fmt , # # args ) ; \
btrfs_set_opt ( fs_info - > mount_opt , opt ) ; \
2020-07-06 11:59:36 -03:00
} while ( 0 )
2014-04-23 19:33:33 +08:00
2016-06-09 21:38:35 -04:00
# define btrfs_clear_and_info(fs_info, opt, fmt, args...) \
2020-07-06 11:59:36 -03:00
do { \
2016-06-09 21:38:35 -04:00
if ( btrfs_test_opt ( fs_info , opt ) ) \
btrfs_info ( fs_info , fmt , # # args ) ; \
btrfs_clear_opt ( fs_info - > mount_opt , opt ) ; \
2020-07-06 11:59:36 -03:00
} while ( 0 )
2014-04-23 19:33:33 +08:00
2014-02-05 15:26:17 +01:00
/*
* Requests for changes that need to be done during transaction commit .
*
* Internal mount options that are used for special handling of the real
* mount options ( eg . cannot be set during remount and have to be set during
* transaction commit )
*/
2020-11-26 15:10:39 +02:00
# define BTRFS_PENDING_COMMIT (0)
2014-02-05 15:26:17 +01:00
2014-02-05 15:26:17 +01:00
# define btrfs_test_pending(info, opt) \
test_bit ( BTRFS_PENDING_ # # opt , & ( info ) - > pending_changes )
# define btrfs_set_pending(info, opt) \
set_bit ( BTRFS_PENDING_ # # opt , & ( info ) - > pending_changes )
# define btrfs_clear_pending(info, opt) \
clear_bit ( BTRFS_PENDING_ # # opt , & ( info ) - > pending_changes )
/*
* Helpers for setting pending mount option changes .
*
* Expects corresponding macros
* BTRFS_PENDING_SET_ and CLEAR_ + short mount option name
*/
# define btrfs_set_pending_and_info(info, opt, fmt, args...) \
do { \
if ( ! btrfs_raw_test_opt ( ( info ) - > mount_opt , opt ) ) { \
btrfs_info ( ( info ) , fmt , # # args ) ; \
btrfs_set_pending ( ( info ) , SET_ # # opt ) ; \
btrfs_clear_pending ( ( info ) , CLEAR_ # # opt ) ; \
} \
} while ( 0 )
# define btrfs_clear_pending_and_info(info, opt, fmt, args...) \
do { \
if ( btrfs_raw_test_opt ( ( info ) - > mount_opt , opt ) ) { \
btrfs_info ( ( info ) , fmt , # # args ) ; \
btrfs_set_pending ( ( info ) , CLEAR_ # # opt ) ; \
btrfs_clear_pending ( ( info ) , SET_ # # opt ) ; \
} \
} while ( 0 )
2008-01-08 15:54:37 -05:00
/*
* Inode flags
*/
2008-01-14 13:26:08 -05:00
# define BTRFS_INODE_NODATASUM (1 << 0)
# define BTRFS_INODE_NODATACOW (1 << 1)
# define BTRFS_INODE_READONLY (1 << 2)
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
# define BTRFS_INODE_NOCOMPRESS (1 << 3)
2008-10-30 14:25:28 -04:00
# define BTRFS_INODE_PREALLOC (1 << 4)
2009-04-17 10:37:41 +02:00
# define BTRFS_INODE_SYNC (1 << 5)
# define BTRFS_INODE_IMMUTABLE (1 << 6)
# define BTRFS_INODE_APPEND (1 << 7)
# define BTRFS_INODE_NODUMP (1 << 8)
# define BTRFS_INODE_NOATIME (1 << 9)
# define BTRFS_INODE_DIRSYNC (1 << 10)
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 10:12:20 +00:00
# define BTRFS_INODE_COMPRESS (1 << 11)
2009-04-17 10:37:41 +02:00
2011-03-28 02:01:25 +00:00
# define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31)
2019-03-13 14:31:35 +08:00
# define BTRFS_INODE_FLAG_MASK \
( BTRFS_INODE_NODATASUM | \
BTRFS_INODE_NODATACOW | \
BTRFS_INODE_READONLY | \
BTRFS_INODE_NOCOMPRESS | \
BTRFS_INODE_PREALLOC | \
BTRFS_INODE_SYNC | \
BTRFS_INODE_IMMUTABLE | \
BTRFS_INODE_APPEND | \
BTRFS_INODE_NODUMP | \
BTRFS_INODE_NOATIME | \
BTRFS_INODE_DIRSYNC | \
BTRFS_INODE_COMPRESS | \
BTRFS_INODE_ROOT_ITEM_INIT )
2012-03-03 07:40:03 -05:00
struct btrfs_map_token {
2020-04-29 02:15:56 +02:00
struct extent_buffer * eb ;
2012-03-03 07:40:03 -05:00
char * kaddr ;
unsigned long offset ;
} ;
2016-01-21 15:55:53 +05:30
# define BTRFS_BYTES_TO_BLKS(fs_info, bytes) \
2020-07-01 21:19:09 +02:00
( ( bytes ) > > ( fs_info ) - > sectorsize_bits )
2016-01-21 15:55:53 +05:30
2019-08-09 17:48:21 +02:00
static inline void btrfs_init_map_token ( struct btrfs_map_token * token ,
struct extent_buffer * eb )
2012-03-03 07:40:03 -05:00
{
2019-08-09 17:48:21 +02:00
token - > eb = eb ;
2020-04-29 19:29:04 +02:00
token - > kaddr = page_address ( eb - > pages [ 0 ] ) ;
token - > offset = 0 ;
2012-03-03 07:40:03 -05:00
}
2016-05-19 21:18:45 -04:00
/* some macros to generate set/get functions for the struct fields. This
2007-10-15 16:14:19 -04:00
* assumes there is a lefoo_to_cpu for every type , so lets make a simple
* one for u8 :
*/
# define le8_to_cpu(v) (v)
# define cpu_to_le8(v) (v)
# define __le8 u8
2020-09-15 14:58:42 +02:00
static inline u8 get_unaligned_le8 ( const void * p )
{
return * ( u8 * ) p ;
}
static inline void put_unaligned_le8 ( u8 val , void * p )
{
* ( u8 * ) p = val ;
}
2016-09-20 10:05:01 -04:00
# define read_eb_member(eb, ptr, type, member, result) (\
2007-10-15 16:14:19 -04:00
read_extent_buffer ( eb , ( char * ) ( result ) , \
( ( unsigned long ) ( ptr ) ) + \
offsetof ( type , member ) , \
sizeof ( ( ( type * ) 0 ) - > member ) ) )
2016-09-20 10:05:01 -04:00
# define write_eb_member(eb, ptr, type, member, result) (\
2007-10-15 16:14:19 -04:00
write_extent_buffer ( eb , ( char * ) ( result ) , \
( ( unsigned long ) ( ptr ) ) + \
offsetof ( type , member ) , \
sizeof ( ( ( type * ) 0 ) - > member ) ) )
2012-07-09 20:22:35 -06:00
# define DECLARE_BTRFS_SETGET_BITS(bits) \
2020-04-29 02:15:56 +02:00
u # # bits btrfs_get_token_ # # bits ( struct btrfs_map_token * token , \
const void * ptr , unsigned long off ) ; \
void btrfs_set_token_ # # bits ( struct btrfs_map_token * token , \
const void * ptr , unsigned long off , \
u # # bits val ) ; \
2019-08-09 17:12:38 +02:00
u # # bits btrfs_get_ # # bits ( const struct extent_buffer * eb , \
const void * ptr , unsigned long off ) ; \
2020-04-29 03:04:10 +02:00
void btrfs_set_ # # bits ( const struct extent_buffer * eb , void * ptr , \
2019-08-09 17:12:38 +02:00
unsigned long off , u # # bits val ) ;
2012-07-09 20:22:35 -06:00
DECLARE_BTRFS_SETGET_BITS ( 8 )
DECLARE_BTRFS_SETGET_BITS ( 16 )
DECLARE_BTRFS_SETGET_BITS ( 32 )
DECLARE_BTRFS_SETGET_BITS ( 64 )
2007-10-15 16:14:19 -04:00
# define BTRFS_SETGET_FUNCS(name, type, member, bits) \
2017-06-28 21:56:53 -06:00
static inline u # # bits btrfs_ # # name ( const struct extent_buffer * eb , \
const type * s ) \
2012-07-09 20:22:35 -06:00
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
return btrfs_get_ # # bits ( eb , s , offsetof ( type , member ) ) ; \
} \
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_ # # name ( const struct extent_buffer * eb , type * s , \
2012-07-09 20:22:35 -06:00
u # # bits val ) \
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
btrfs_set_ # # bits ( eb , s , offsetof ( type , member ) , val ) ; \
} \
2020-04-29 02:15:56 +02:00
static inline u # # bits btrfs_token_ # # name ( struct btrfs_map_token * token , \
const type * s ) \
2012-07-09 20:22:35 -06:00
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
2020-04-29 02:15:56 +02:00
return btrfs_get_token_ # # bits ( token , s , offsetof ( type , member ) ) ; \
2012-07-09 20:22:35 -06:00
} \
2020-04-29 02:15:56 +02:00
static inline void btrfs_set_token_ # # name ( struct btrfs_map_token * token , \
type * s , u # # bits val ) \
2012-07-09 20:22:35 -06:00
{ \
BUILD_BUG_ON ( sizeof ( u # # bits ) ! = sizeof ( ( ( type * ) 0 ) ) - > member ) ; \
2020-04-29 02:15:56 +02:00
btrfs_set_token_ # # bits ( token , s , offsetof ( type , member ) , val ) ; \
2012-07-09 20:22:35 -06:00
}
2007-10-15 16:14:19 -04:00
# define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \
2017-06-28 21:56:53 -06:00
static inline u # # bits btrfs_ # # name ( const struct extent_buffer * eb ) \
2007-10-15 16:14:19 -04:00
{ \
2020-12-02 14:48:04 +08:00
const type * p = page_address ( eb - > pages [ 0 ] ) + \
offset_in_page ( eb - > start ) ; \
2020-09-15 14:58:42 +02:00
return get_unaligned_le # # bits ( & p - > member ) ; \
2007-10-15 16:14:19 -04:00
} \
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_ # # name ( const struct extent_buffer * eb , \
2007-10-15 16:14:19 -04:00
u # # bits val ) \
{ \
2020-12-02 14:48:04 +08:00
type * p = page_address ( eb - > pages [ 0 ] ) + offset_in_page ( eb - > start ) ; \
2020-09-15 14:58:42 +02:00
put_unaligned_le # # bits ( val , & p - > member ) ; \
2007-10-15 16:14:19 -04:00
}
2007-04-26 16:46:15 -04:00
2007-10-15 16:14:19 -04:00
# define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \
2017-06-28 21:56:53 -06:00
static inline u # # bits btrfs_ # # name ( const type * s ) \
2007-10-15 16:14:19 -04:00
{ \
2020-09-15 14:58:42 +02:00
return get_unaligned_le # # bits ( & s - > member ) ; \
2007-10-15 16:14:19 -04:00
} \
static inline void btrfs_set_ # # name ( type * s , u # # bits val ) \
{ \
2020-09-15 14:58:42 +02:00
put_unaligned_le # # bits ( val , & s - > member ) ; \
2007-03-15 19:03:33 -04:00
}
2020-04-29 03:04:10 +02:00
static inline u64 btrfs_device_total_bytes ( const struct extent_buffer * eb ,
2017-06-16 14:39:19 +03:00
struct btrfs_dev_item * s )
{
BUILD_BUG_ON ( sizeof ( u64 ) ! =
sizeof ( ( ( struct btrfs_dev_item * ) 0 ) ) - > total_bytes ) ;
return btrfs_get_64 ( eb , s , offsetof ( struct btrfs_dev_item ,
total_bytes ) ) ;
}
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_device_total_bytes ( const struct extent_buffer * eb ,
2017-06-16 14:39:19 +03:00
struct btrfs_dev_item * s ,
u64 val )
{
BUILD_BUG_ON ( sizeof ( u64 ) ! =
sizeof ( ( ( struct btrfs_dev_item * ) 0 ) ) - > total_bytes ) ;
2017-06-16 14:39:20 +03:00
WARN_ON ( ! IS_ALIGNED ( val , eb - > fs_info - > sectorsize ) ) ;
2017-06-16 14:39:19 +03:00
btrfs_set_64 ( eb , s , offsetof ( struct btrfs_dev_item , total_bytes ) , val ) ;
}
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( device_type , struct btrfs_dev_item , type , 64 ) ;
BTRFS_SETGET_FUNCS ( device_bytes_used , struct btrfs_dev_item , bytes_used , 64 ) ;
BTRFS_SETGET_FUNCS ( device_io_align , struct btrfs_dev_item , io_align , 32 ) ;
BTRFS_SETGET_FUNCS ( device_io_width , struct btrfs_dev_item , io_width , 32 ) ;
2008-12-08 16:40:21 -05:00
BTRFS_SETGET_FUNCS ( device_start_offset , struct btrfs_dev_item ,
start_offset , 64 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( device_sector_size , struct btrfs_dev_item , sector_size , 32 ) ;
BTRFS_SETGET_FUNCS ( device_id , struct btrfs_dev_item , devid , 64 ) ;
2008-04-15 15:41:47 -04:00
BTRFS_SETGET_FUNCS ( device_group , struct btrfs_dev_item , dev_group , 32 ) ;
BTRFS_SETGET_FUNCS ( device_seek_speed , struct btrfs_dev_item , seek_speed , 8 ) ;
BTRFS_SETGET_FUNCS ( device_bandwidth , struct btrfs_dev_item , bandwidth , 8 ) ;
2008-11-17 21:11:30 -05:00
BTRFS_SETGET_FUNCS ( device_generation , struct btrfs_dev_item , generation , 64 ) ;
2008-03-24 15:01:56 -04:00
2008-03-24 15:02:07 -04:00
BTRFS_SETGET_STACK_FUNCS ( stack_device_type , struct btrfs_dev_item , type , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_total_bytes , struct btrfs_dev_item ,
total_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_bytes_used , struct btrfs_dev_item ,
bytes_used , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_io_align , struct btrfs_dev_item ,
io_align , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_io_width , struct btrfs_dev_item ,
io_width , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_sector_size , struct btrfs_dev_item ,
sector_size , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_id , struct btrfs_dev_item , devid , 64 ) ;
2008-04-15 15:41:47 -04:00
BTRFS_SETGET_STACK_FUNCS ( stack_device_group , struct btrfs_dev_item ,
dev_group , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_seek_speed , struct btrfs_dev_item ,
seek_speed , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_device_bandwidth , struct btrfs_dev_item ,
bandwidth , 8 ) ;
2008-11-17 21:11:30 -05:00
BTRFS_SETGET_STACK_FUNCS ( stack_device_generation , struct btrfs_dev_item ,
generation , 64 ) ;
2008-03-24 15:02:07 -04:00
2013-08-20 13:20:11 +02:00
static inline unsigned long btrfs_device_uuid ( struct btrfs_dev_item * d )
2008-03-24 15:01:56 -04:00
{
2013-08-20 13:20:11 +02:00
return ( unsigned long ) d + offsetof ( struct btrfs_dev_item , uuid ) ;
2008-03-24 15:01:56 -04:00
}
2013-08-20 13:20:12 +02:00
static inline unsigned long btrfs_device_fsid ( struct btrfs_dev_item * d )
2008-11-17 21:11:30 -05:00
{
2013-08-20 13:20:12 +02:00
return ( unsigned long ) d + offsetof ( struct btrfs_dev_item , fsid ) ;
2008-11-17 21:11:30 -05:00
}
2008-04-15 15:41:47 -04:00
BTRFS_SETGET_FUNCS ( chunk_length , struct btrfs_chunk , length , 64 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( chunk_owner , struct btrfs_chunk , owner , 64 ) ;
BTRFS_SETGET_FUNCS ( chunk_stripe_len , struct btrfs_chunk , stripe_len , 64 ) ;
BTRFS_SETGET_FUNCS ( chunk_io_align , struct btrfs_chunk , io_align , 32 ) ;
BTRFS_SETGET_FUNCS ( chunk_io_width , struct btrfs_chunk , io_width , 32 ) ;
BTRFS_SETGET_FUNCS ( chunk_sector_size , struct btrfs_chunk , sector_size , 32 ) ;
BTRFS_SETGET_FUNCS ( chunk_type , struct btrfs_chunk , type , 64 ) ;
BTRFS_SETGET_FUNCS ( chunk_num_stripes , struct btrfs_chunk , num_stripes , 16 ) ;
2008-04-16 10:49:51 -04:00
BTRFS_SETGET_FUNCS ( chunk_sub_stripes , struct btrfs_chunk , sub_stripes , 16 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( stripe_devid , struct btrfs_stripe , devid , 64 ) ;
BTRFS_SETGET_FUNCS ( stripe_offset , struct btrfs_stripe , offset , 64 ) ;
2008-04-15 15:41:47 -04:00
static inline char * btrfs_stripe_dev_uuid ( struct btrfs_stripe * s )
{
return ( char * ) s + offsetof ( struct btrfs_stripe , dev_uuid ) ;
}
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_length , struct btrfs_chunk , length , 64 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_owner , struct btrfs_chunk , owner , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_stripe_len , struct btrfs_chunk ,
stripe_len , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_io_align , struct btrfs_chunk ,
io_align , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_io_width , struct btrfs_chunk ,
io_width , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_sector_size , struct btrfs_chunk ,
sector_size , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_type , struct btrfs_chunk , type , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_num_stripes , struct btrfs_chunk ,
num_stripes , 16 ) ;
2008-04-16 10:49:51 -04:00
BTRFS_SETGET_STACK_FUNCS ( stack_chunk_sub_stripes , struct btrfs_chunk ,
sub_stripes , 16 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_STACK_FUNCS ( stack_stripe_devid , struct btrfs_stripe , devid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_stripe_offset , struct btrfs_stripe , offset , 64 ) ;
static inline struct btrfs_stripe * btrfs_stripe_nr ( struct btrfs_chunk * c ,
int nr )
{
unsigned long offset = ( unsigned long ) c ;
offset + = offsetof ( struct btrfs_chunk , stripe ) ;
offset + = nr * sizeof ( struct btrfs_stripe ) ;
return ( struct btrfs_stripe * ) offset ;
}
2008-04-18 10:29:38 -04:00
static inline char * btrfs_stripe_dev_uuid_nr ( struct btrfs_chunk * c , int nr )
{
return btrfs_stripe_dev_uuid ( btrfs_stripe_nr ( c , nr ) ) ;
}
2020-04-29 03:04:10 +02:00
static inline u64 btrfs_stripe_offset_nr ( const struct extent_buffer * eb ,
2008-03-24 15:01:56 -04:00
struct btrfs_chunk * c , int nr )
{
return btrfs_stripe_offset ( eb , btrfs_stripe_nr ( c , nr ) ) ;
}
2020-04-29 03:04:10 +02:00
static inline u64 btrfs_stripe_devid_nr ( const struct extent_buffer * eb ,
2008-03-24 15:01:56 -04:00
struct btrfs_chunk * c , int nr )
{
return btrfs_stripe_devid ( eb , btrfs_stripe_nr ( c , nr ) ) ;
}
2007-10-15 16:14:19 -04:00
/* struct btrfs_block_group_item */
2019-10-23 18:48:18 +02:00
BTRFS_SETGET_STACK_FUNCS ( stack_block_group_used , struct btrfs_block_group_item ,
2007-10-15 16:14:19 -04:00
used , 64 ) ;
2019-10-23 18:48:20 +02:00
BTRFS_SETGET_FUNCS ( block_group_used , struct btrfs_block_group_item ,
2007-10-15 16:14:19 -04:00
used , 64 ) ;
2019-10-23 18:48:18 +02:00
BTRFS_SETGET_STACK_FUNCS ( stack_block_group_chunk_objectid ,
2008-03-24 15:01:56 -04:00
struct btrfs_block_group_item , chunk_objectid , 64 ) ;
2008-04-15 15:41:47 -04:00
2019-10-23 18:48:20 +02:00
BTRFS_SETGET_FUNCS ( block_group_chunk_objectid ,
2008-03-24 15:01:56 -04:00
struct btrfs_block_group_item , chunk_objectid , 64 ) ;
2019-10-23 18:48:20 +02:00
BTRFS_SETGET_FUNCS ( block_group_flags ,
2008-03-24 15:01:56 -04:00
struct btrfs_block_group_item , flags , 64 ) ;
2019-10-23 18:48:18 +02:00
BTRFS_SETGET_STACK_FUNCS ( stack_block_group_flags ,
2008-03-24 15:01:56 -04:00
struct btrfs_block_group_item , flags , 64 ) ;
2007-03-15 19:03:33 -04:00
2015-09-29 20:50:34 -07:00
/* struct btrfs_free_space_info */
BTRFS_SETGET_FUNCS ( free_space_extent_count , struct btrfs_free_space_info ,
extent_count , 32 ) ;
BTRFS_SETGET_FUNCS ( free_space_flags , struct btrfs_free_space_info , flags , 32 ) ;
2007-12-12 14:38:19 -05:00
/* struct btrfs_inode_ref */
BTRFS_SETGET_FUNCS ( inode_ref_name_len , struct btrfs_inode_ref , name_len , 16 ) ;
2008-07-24 12:12:38 -04:00
BTRFS_SETGET_FUNCS ( inode_ref_index , struct btrfs_inode_ref , index , 64 ) ;
2007-12-12 14:38:19 -05:00
2012-08-08 11:32:27 -07:00
/* struct btrfs_inode_extref */
BTRFS_SETGET_FUNCS ( inode_extref_parent , struct btrfs_inode_extref ,
parent_objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( inode_extref_name_len , struct btrfs_inode_extref ,
name_len , 16 ) ;
BTRFS_SETGET_FUNCS ( inode_extref_index , struct btrfs_inode_extref , index , 64 ) ;
2007-10-15 16:14:19 -04:00
/* struct btrfs_inode_item */
BTRFS_SETGET_FUNCS ( inode_generation , struct btrfs_inode_item , generation , 64 ) ;
2008-12-08 16:40:21 -05:00
BTRFS_SETGET_FUNCS ( inode_sequence , struct btrfs_inode_item , sequence , 64 ) ;
2008-09-05 16:13:11 -04:00
BTRFS_SETGET_FUNCS ( inode_transid , struct btrfs_inode_item , transid , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_FUNCS ( inode_size , struct btrfs_inode_item , size , 64 ) ;
2008-10-09 11:46:29 -04:00
BTRFS_SETGET_FUNCS ( inode_nbytes , struct btrfs_inode_item , nbytes , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_FUNCS ( inode_block_group , struct btrfs_inode_item , block_group , 64 ) ;
BTRFS_SETGET_FUNCS ( inode_nlink , struct btrfs_inode_item , nlink , 32 ) ;
BTRFS_SETGET_FUNCS ( inode_uid , struct btrfs_inode_item , uid , 32 ) ;
BTRFS_SETGET_FUNCS ( inode_gid , struct btrfs_inode_item , gid , 32 ) ;
BTRFS_SETGET_FUNCS ( inode_mode , struct btrfs_inode_item , mode , 32 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( inode_rdev , struct btrfs_inode_item , rdev , 64 ) ;
2008-12-02 06:36:08 -05:00
BTRFS_SETGET_FUNCS ( inode_flags , struct btrfs_inode_item , flags , 64 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_inode_generation , struct btrfs_inode_item ,
generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_sequence , struct btrfs_inode_item ,
sequence , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_transid , struct btrfs_inode_item ,
transid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_size , struct btrfs_inode_item , size , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_nbytes , struct btrfs_inode_item ,
nbytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_block_group , struct btrfs_inode_item ,
block_group , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_nlink , struct btrfs_inode_item , nlink , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_uid , struct btrfs_inode_item , uid , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_gid , struct btrfs_inode_item , gid , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_mode , struct btrfs_inode_item , mode , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_rdev , struct btrfs_inode_item , rdev , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_inode_flags , struct btrfs_inode_item , flags , 64 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( timespec_sec , struct btrfs_timespec , sec , 64 ) ;
BTRFS_SETGET_FUNCS ( timespec_nsec , struct btrfs_timespec , nsec , 32 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_timespec_sec , struct btrfs_timespec , sec , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_timespec_nsec , struct btrfs_timespec , nsec , 32 ) ;
2007-03-22 12:13:20 -04:00
2008-03-24 15:01:56 -04:00
/* struct btrfs_dev_extent */
2008-04-15 15:41:47 -04:00
BTRFS_SETGET_FUNCS ( dev_extent_chunk_tree , struct btrfs_dev_extent ,
chunk_tree , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_extent_chunk_objectid , struct btrfs_dev_extent ,
chunk_objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_extent_chunk_offset , struct btrfs_dev_extent ,
chunk_offset , 64 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_FUNCS ( dev_extent_length , struct btrfs_dev_extent , length , 64 ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
BTRFS_SETGET_FUNCS ( extent_refs , struct btrfs_extent_item , refs , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_generation , struct btrfs_extent_item ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_flags , struct btrfs_extent_item , flags , 64 ) ;
2007-12-11 09:25:06 -05:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
BTRFS_SETGET_FUNCS ( tree_block_level , struct btrfs_tree_block_info , level , 8 ) ;
2020-04-29 03:04:10 +02:00
static inline void btrfs_tree_block_key ( const struct extent_buffer * eb ,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
struct btrfs_tree_block_info * item ,
struct btrfs_disk_key * key )
{
read_eb_member ( eb , item , struct btrfs_tree_block_info , key , key ) ;
}
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_tree_block_key ( const struct extent_buffer * eb ,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
struct btrfs_tree_block_info * item ,
struct btrfs_disk_key * key )
{
write_eb_member ( eb , item , struct btrfs_tree_block_info , key , key ) ;
}
2007-03-22 12:13:20 -04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
BTRFS_SETGET_FUNCS ( extent_data_ref_root , struct btrfs_extent_data_ref ,
root , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_data_ref_objectid , struct btrfs_extent_data_ref ,
objectid , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_data_ref_offset , struct btrfs_extent_data_ref ,
offset , 64 ) ;
BTRFS_SETGET_FUNCS ( extent_data_ref_count , struct btrfs_extent_data_ref ,
count , 32 ) ;
BTRFS_SETGET_FUNCS ( shared_data_ref_count , struct btrfs_shared_data_ref ,
count , 32 ) ;
BTRFS_SETGET_FUNCS ( extent_inline_ref_type , struct btrfs_extent_inline_ref ,
type , 8 ) ;
BTRFS_SETGET_FUNCS ( extent_inline_ref_offset , struct btrfs_extent_inline_ref ,
offset , 64 ) ;
static inline u32 btrfs_extent_inline_ref_size ( int type )
{
if ( type = = BTRFS_TREE_BLOCK_REF_KEY | |
type = = BTRFS_SHARED_BLOCK_REF_KEY )
return sizeof ( struct btrfs_extent_inline_ref ) ;
if ( type = = BTRFS_SHARED_DATA_REF_KEY )
return sizeof ( struct btrfs_shared_data_ref ) +
sizeof ( struct btrfs_extent_inline_ref ) ;
if ( type = = BTRFS_EXTENT_DATA_REF_KEY )
return sizeof ( struct btrfs_extent_data_ref ) +
offsetof ( struct btrfs_extent_inline_ref , offset ) ;
return 0 ;
}
2007-10-15 16:14:19 -04:00
/* struct btrfs_node */
BTRFS_SETGET_FUNCS ( key_blockptr , struct btrfs_key_ptr , blockptr , 64 ) ;
2007-12-11 09:25:06 -05:00
BTRFS_SETGET_FUNCS ( key_generation , struct btrfs_key_ptr , generation , 64 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_key_blockptr , struct btrfs_key_ptr ,
blockptr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_key_generation , struct btrfs_key_ptr ,
generation , 64 ) ;
2007-03-22 12:13:20 -04:00
2020-04-29 03:04:10 +02:00
static inline u64 btrfs_node_blockptr ( const struct extent_buffer * eb , int nr )
2007-03-13 09:49:06 -04:00
{
2007-10-15 16:14:19 -04:00
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
return btrfs_key_blockptr ( eb , ( struct btrfs_key_ptr * ) ptr ) ;
2007-03-13 09:49:06 -04:00
}
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_node_blockptr ( const struct extent_buffer * eb ,
2007-10-15 16:14:19 -04:00
int nr , u64 val )
2007-03-13 09:49:06 -04:00
{
2007-10-15 16:14:19 -04:00
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
btrfs_set_key_blockptr ( eb , ( struct btrfs_key_ptr * ) ptr , val ) ;
2007-03-13 09:49:06 -04:00
}
2020-04-29 03:04:10 +02:00
static inline u64 btrfs_node_ptr_generation ( const struct extent_buffer * eb , int nr )
2007-12-11 09:25:06 -05:00
{
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
return btrfs_key_generation ( eb , ( struct btrfs_key_ptr * ) ptr ) ;
}
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_node_ptr_generation ( const struct extent_buffer * eb ,
2007-12-11 09:25:06 -05:00
int nr , u64 val )
{
unsigned long ptr ;
ptr = offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
btrfs_set_key_generation ( eb , ( struct btrfs_key_ptr * ) ptr , val ) ;
}
2007-10-15 16:18:55 -04:00
static inline unsigned long btrfs_node_key_ptr_offset ( int nr )
2007-04-20 20:23:12 -04:00
{
2007-10-15 16:14:19 -04:00
return offsetof ( struct btrfs_node , ptrs ) +
sizeof ( struct btrfs_key_ptr ) * nr ;
2007-04-20 20:23:12 -04:00
}
2017-06-28 21:56:53 -06:00
void btrfs_node_key ( const struct extent_buffer * eb ,
2007-11-06 15:09:29 -05:00
struct btrfs_disk_key * disk_key , int nr ) ;
2020-04-29 03:04:10 +02:00
static inline void btrfs_set_node_key ( const struct extent_buffer * eb ,
2007-10-15 16:14:19 -04:00
struct btrfs_disk_key * disk_key , int nr )
2007-03-13 09:28:32 -04:00
{
2007-10-15 16:14:19 -04:00
unsigned long ptr ;
ptr = btrfs_node_key_ptr_offset ( nr ) ;
write_eb_member ( eb , ( struct btrfs_key_ptr * ) ptr ,
struct btrfs_key_ptr , key , disk_key ) ;
2007-03-13 09:28:32 -04:00
}
2007-10-15 16:14:19 -04:00
/* struct btrfs_item */
BTRFS_SETGET_FUNCS ( item_offset , struct btrfs_item , offset , 32 ) ;
BTRFS_SETGET_FUNCS ( item_size , struct btrfs_item , size , 32 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_item_offset , struct btrfs_item , offset , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_item_size , struct btrfs_item , size , 32 ) ;
2007-04-20 20:23:12 -04:00
2007-10-15 16:14:19 -04:00
static inline unsigned long btrfs_item_nr_offset ( int nr )
2007-03-13 09:28:32 -04:00
{
2007-10-15 16:14:19 -04:00
return offsetof ( struct btrfs_leaf , items ) +
sizeof ( struct btrfs_item ) * nr ;
2007-03-13 09:28:32 -04:00
}
2013-09-16 15:58:09 +01:00
static inline struct btrfs_item * btrfs_item_nr ( int nr )
2007-03-12 20:12:07 -04:00
{
2007-10-15 16:14:19 -04:00
return ( struct btrfs_item * ) btrfs_item_nr_offset ( nr ) ;
2007-03-12 20:12:07 -04:00
}
2017-06-28 21:56:53 -06:00
static inline u32 btrfs_item_end ( const struct extent_buffer * eb ,
2007-10-15 16:14:19 -04:00
struct btrfs_item * item )
2007-03-12 20:12:07 -04:00
{
2007-10-15 16:14:19 -04:00
return btrfs_item_offset ( eb , item ) + btrfs_item_size ( eb , item ) ;
2007-03-12 20:12:07 -04:00
}
2017-06-28 21:56:53 -06:00
static inline u32 btrfs_item_end_nr ( const struct extent_buffer * eb , int nr )
2007-03-12 20:12:07 -04:00
{
2013-09-16 15:58:09 +01:00
return btrfs_item_end ( eb , btrfs_item_nr ( nr ) ) ;
2007-03-12 20:12:07 -04:00
}
2017-06-28 21:56:53 -06:00
static inline u32 btrfs_item_offset_nr ( const struct extent_buffer * eb , int nr )
2007-03-12 20:12:07 -04:00
{
2013-09-16 15:58:09 +01:00
return btrfs_item_offset ( eb , btrfs_item_nr ( nr ) ) ;
2007-03-12 20:12:07 -04:00
}
2017-06-28 21:56:53 -06:00
static inline u32 btrfs_item_size_nr ( const struct extent_buffer * eb , int nr )
2007-03-12 20:12:07 -04:00
{
2013-09-16 15:58:09 +01:00
return btrfs_item_size ( eb , btrfs_item_nr ( nr ) ) ;
2007-03-12 20:12:07 -04:00
}
2017-06-28 21:56:53 -06:00
static inline void btrfs_item_key ( const struct extent_buffer * eb ,
2007-10-15 16:14:19 -04:00
struct btrfs_disk_key * disk_key , int nr )
2007-03-15 15:18:43 -04:00
{
2013-09-16 15:58:09 +01:00
struct btrfs_item * item = btrfs_item_nr ( nr ) ;
2007-10-15 16:14:19 -04:00
read_eb_member ( eb , item , struct btrfs_item , key , disk_key ) ;
2007-03-15 15:18:43 -04:00
}
2007-10-15 16:14:19 -04:00
static inline void btrfs_set_item_key ( struct extent_buffer * eb ,
struct btrfs_disk_key * disk_key , int nr )
2007-03-15 15:18:43 -04:00
{
2013-09-16 15:58:09 +01:00
struct btrfs_item * item = btrfs_item_nr ( nr ) ;
2007-10-15 16:14:19 -04:00
write_eb_member ( eb , item , struct btrfs_item , key , disk_key ) ;
2007-03-15 15:18:43 -04:00
}
2008-09-05 16:13:11 -04:00
BTRFS_SETGET_FUNCS ( dir_log_end , struct btrfs_dir_log_item , end , 64 ) ;
2008-11-17 20:37:39 -05:00
/*
* struct btrfs_root_ref
*/
BTRFS_SETGET_FUNCS ( root_ref_dirid , struct btrfs_root_ref , dirid , 64 ) ;
BTRFS_SETGET_FUNCS ( root_ref_sequence , struct btrfs_root_ref , sequence , 64 ) ;
BTRFS_SETGET_FUNCS ( root_ref_name_len , struct btrfs_root_ref , name_len , 16 ) ;
2007-10-15 16:14:19 -04:00
/* struct btrfs_dir_item */
2007-11-16 11:45:54 -05:00
BTRFS_SETGET_FUNCS ( dir_data_len , struct btrfs_dir_item , data_len , 16 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_FUNCS ( dir_type , struct btrfs_dir_item , type , 8 ) ;
BTRFS_SETGET_FUNCS ( dir_name_len , struct btrfs_dir_item , name_len , 16 ) ;
2008-09-05 16:13:11 -04:00
BTRFS_SETGET_FUNCS ( dir_transid , struct btrfs_dir_item , transid , 64 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_dir_type , struct btrfs_dir_item , type , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dir_data_len , struct btrfs_dir_item ,
data_len , 16 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dir_name_len , struct btrfs_dir_item ,
name_len , 16 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dir_transid , struct btrfs_dir_item ,
transid , 64 ) ;
2007-03-15 15:18:43 -04:00
2017-06-28 21:56:53 -06:00
static inline void btrfs_dir_item_key ( const struct extent_buffer * eb ,
const struct btrfs_dir_item * item ,
2007-10-15 16:14:19 -04:00
struct btrfs_disk_key * key )
2007-03-15 15:18:43 -04:00
{
2007-10-15 16:14:19 -04:00
read_eb_member ( eb , item , struct btrfs_dir_item , location , key ) ;
2007-03-15 15:18:43 -04:00
}
2007-10-15 16:14:19 -04:00
static inline void btrfs_set_dir_item_key ( struct extent_buffer * eb ,
struct btrfs_dir_item * item ,
2017-06-28 21:56:53 -06:00
const struct btrfs_disk_key * key )
2007-03-16 08:46:49 -04:00
{
2007-10-15 16:14:19 -04:00
write_eb_member ( eb , item , struct btrfs_dir_item , location , key ) ;
2007-03-16 08:46:49 -04:00
}
2010-06-21 14:48:16 -04:00
BTRFS_SETGET_FUNCS ( free_space_entries , struct btrfs_free_space_header ,
num_entries , 64 ) ;
BTRFS_SETGET_FUNCS ( free_space_bitmaps , struct btrfs_free_space_header ,
num_bitmaps , 64 ) ;
BTRFS_SETGET_FUNCS ( free_space_generation , struct btrfs_free_space_header ,
generation , 64 ) ;
2017-06-28 21:56:53 -06:00
static inline void btrfs_free_space_key ( const struct extent_buffer * eb ,
const struct btrfs_free_space_header * h ,
2010-06-21 14:48:16 -04:00
struct btrfs_disk_key * key )
{
read_eb_member ( eb , h , struct btrfs_free_space_header , location , key ) ;
}
static inline void btrfs_set_free_space_key ( struct extent_buffer * eb ,
struct btrfs_free_space_header * h ,
2017-06-28 21:56:53 -06:00
const struct btrfs_disk_key * key )
2010-06-21 14:48:16 -04:00
{
write_eb_member ( eb , h , struct btrfs_free_space_header , location , key ) ;
}
2007-10-15 16:14:19 -04:00
/* struct btrfs_disk_key */
BTRFS_SETGET_STACK_FUNCS ( disk_key_objectid , struct btrfs_disk_key ,
objectid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( disk_key_offset , struct btrfs_disk_key , offset , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( disk_key_type , struct btrfs_disk_key , type , 8 ) ;
2007-03-15 15:18:43 -04:00
2020-06-08 16:06:07 +02:00
# ifdef __LITTLE_ENDIAN
/*
* Optimized helpers for little - endian architectures where CPU and on - disk
* structures have the same endianness and we can skip conversions .
*/
static inline void btrfs_disk_key_to_cpu ( struct btrfs_key * cpu_key ,
const struct btrfs_disk_key * disk_key )
{
memcpy ( cpu_key , disk_key , sizeof ( struct btrfs_key ) ) ;
}
static inline void btrfs_cpu_key_to_disk ( struct btrfs_disk_key * disk_key ,
const struct btrfs_key * cpu_key )
{
memcpy ( disk_key , cpu_key , sizeof ( struct btrfs_key ) ) ;
}
static inline void btrfs_node_key_to_cpu ( const struct extent_buffer * eb ,
struct btrfs_key * cpu_key , int nr )
{
struct btrfs_disk_key * disk_key = ( struct btrfs_disk_key * ) cpu_key ;
btrfs_node_key ( eb , disk_key , nr ) ;
}
static inline void btrfs_item_key_to_cpu ( const struct extent_buffer * eb ,
struct btrfs_key * cpu_key , int nr )
{
struct btrfs_disk_key * disk_key = ( struct btrfs_disk_key * ) cpu_key ;
btrfs_item_key ( eb , disk_key , nr ) ;
}
static inline void btrfs_dir_item_key_to_cpu ( const struct extent_buffer * eb ,
const struct btrfs_dir_item * item ,
struct btrfs_key * cpu_key )
{
struct btrfs_disk_key * disk_key = ( struct btrfs_disk_key * ) cpu_key ;
btrfs_dir_item_key ( eb , item , disk_key ) ;
}
# else
2007-03-12 16:22:34 -04:00
static inline void btrfs_disk_key_to_cpu ( struct btrfs_key * cpu ,
2017-01-17 23:24:37 -08:00
const struct btrfs_disk_key * disk )
2007-03-12 16:22:34 -04:00
{
cpu - > offset = le64_to_cpu ( disk - > offset ) ;
2007-10-15 16:14:19 -04:00
cpu - > type = disk - > type ;
2007-03-12 16:22:34 -04:00
cpu - > objectid = le64_to_cpu ( disk - > objectid ) ;
}
static inline void btrfs_cpu_key_to_disk ( struct btrfs_disk_key * disk ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * cpu )
2007-03-12 16:22:34 -04:00
{
disk - > offset = cpu_to_le64 ( cpu - > offset ) ;
2007-10-15 16:14:19 -04:00
disk - > type = cpu - > type ;
2007-03-12 16:22:34 -04:00
disk - > objectid = cpu_to_le64 ( cpu - > objectid ) ;
}
2017-06-28 21:56:53 -06:00
static inline void btrfs_node_key_to_cpu ( const struct extent_buffer * eb ,
struct btrfs_key * key , int nr )
2007-03-23 15:56:19 -04:00
{
2007-10-15 16:14:19 -04:00
struct btrfs_disk_key disk_key ;
btrfs_node_key ( eb , & disk_key , nr ) ;
btrfs_disk_key_to_cpu ( key , & disk_key ) ;
2007-03-23 15:56:19 -04:00
}
2017-06-28 21:56:53 -06:00
static inline void btrfs_item_key_to_cpu ( const struct extent_buffer * eb ,
struct btrfs_key * key , int nr )
2007-03-23 15:56:19 -04:00
{
2007-10-15 16:14:19 -04:00
struct btrfs_disk_key disk_key ;
btrfs_item_key ( eb , & disk_key , nr ) ;
btrfs_disk_key_to_cpu ( key , & disk_key ) ;
2007-03-23 15:56:19 -04:00
}
2017-06-28 21:56:53 -06:00
static inline void btrfs_dir_item_key_to_cpu ( const struct extent_buffer * eb ,
const struct btrfs_dir_item * item ,
struct btrfs_key * key )
2007-04-20 20:23:12 -04:00
{
2007-10-15 16:14:19 -04:00
struct btrfs_disk_key disk_key ;
btrfs_dir_item_key ( eb , item , & disk_key ) ;
btrfs_disk_key_to_cpu ( key , & disk_key ) ;
2007-04-20 20:23:12 -04:00
}
2020-06-08 16:06:07 +02:00
# endif
2007-10-15 16:14:19 -04:00
/* struct btrfs_header */
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_HEADER_FUNCS ( header_bytenr , struct btrfs_header , bytenr , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_HEADER_FUNCS ( header_generation , struct btrfs_header ,
generation , 64 ) ;
BTRFS_SETGET_HEADER_FUNCS ( header_owner , struct btrfs_header , owner , 64 ) ;
BTRFS_SETGET_HEADER_FUNCS ( header_nritems , struct btrfs_header , nritems , 32 ) ;
2008-04-01 11:21:32 -04:00
BTRFS_SETGET_HEADER_FUNCS ( header_flags , struct btrfs_header , flags , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_HEADER_FUNCS ( header_level , struct btrfs_header , level , 8 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_header_generation , struct btrfs_header ,
generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_header_owner , struct btrfs_header , owner , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_header_nritems , struct btrfs_header ,
nritems , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_header_bytenr , struct btrfs_header , bytenr , 64 ) ;
2007-04-09 10:42:37 -04:00
2017-06-28 21:56:53 -06:00
static inline int btrfs_header_flag ( const struct extent_buffer * eb , u64 flag )
2008-04-01 11:21:32 -04:00
{
return ( btrfs_header_flags ( eb ) & flag ) = = flag ;
}
2019-03-19 14:04:17 +08:00
static inline void btrfs_set_header_flag ( struct extent_buffer * eb , u64 flag )
2008-04-01 11:21:32 -04:00
{
u64 flags = btrfs_header_flags ( eb ) ;
btrfs_set_header_flags ( eb , flags | flag ) ;
}
2019-03-19 14:04:17 +08:00
static inline void btrfs_clear_header_flag ( struct extent_buffer * eb , u64 flag )
2008-04-01 11:21:32 -04:00
{
u64 flags = btrfs_header_flags ( eb ) ;
btrfs_set_header_flags ( eb , flags & ~ flag ) ;
}
2017-06-28 21:56:53 -06:00
static inline int btrfs_header_backref_rev ( const struct extent_buffer * eb )
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
{
u64 flags = btrfs_header_flags ( eb ) ;
return flags > > BTRFS_BACKREF_REV_SHIFT ;
}
static inline void btrfs_set_header_backref_rev ( struct extent_buffer * eb ,
int rev )
{
u64 flags = btrfs_header_flags ( eb ) ;
flags & = ~ BTRFS_BACKREF_REV_MASK ;
flags | = ( u64 ) rev < < BTRFS_BACKREF_REV_SHIFT ;
btrfs_set_header_flags ( eb , flags ) ;
}
2017-06-28 21:56:53 -06:00
static inline int btrfs_is_leaf ( const struct extent_buffer * eb )
2007-03-13 16:47:54 -04:00
{
2009-01-05 21:25:51 -05:00
return btrfs_header_level ( eb ) = = 0 ;
2007-03-13 16:47:54 -04:00
}
2007-10-15 16:14:19 -04:00
/* struct btrfs_root_item */
2008-10-29 14:49:05 -04:00
BTRFS_SETGET_FUNCS ( disk_root_generation , struct btrfs_root_item ,
generation , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_FUNCS ( disk_root_refs , struct btrfs_root_item , refs , 32 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_FUNCS ( disk_root_bytenr , struct btrfs_root_item , bytenr , 64 ) ;
BTRFS_SETGET_FUNCS ( disk_root_level , struct btrfs_root_item , level , 8 ) ;
2007-03-13 16:47:54 -04:00
2008-10-29 14:49:05 -04:00
BTRFS_SETGET_STACK_FUNCS ( root_generation , struct btrfs_root_item ,
generation , 64 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_STACK_FUNCS ( root_bytenr , struct btrfs_root_item , bytenr , 64 ) ;
2020-09-15 21:44:52 +02:00
BTRFS_SETGET_STACK_FUNCS ( root_drop_level , struct btrfs_root_item , drop_level , 8 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_STACK_FUNCS ( root_level , struct btrfs_root_item , level , 8 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_STACK_FUNCS ( root_dirid , struct btrfs_root_item , root_dirid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_refs , struct btrfs_root_item , refs , 32 ) ;
2008-12-02 06:36:08 -05:00
BTRFS_SETGET_STACK_FUNCS ( root_flags , struct btrfs_root_item , flags , 64 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_STACK_FUNCS ( root_used , struct btrfs_root_item , bytes_used , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_limit , struct btrfs_root_item , byte_limit , 64 ) ;
2008-10-30 14:20:02 -04:00
BTRFS_SETGET_STACK_FUNCS ( root_last_snapshot , struct btrfs_root_item ,
last_snapshot , 64 ) ;
2012-07-25 17:35:53 +02:00
BTRFS_SETGET_STACK_FUNCS ( root_generation_v2 , struct btrfs_root_item ,
generation_v2 , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_ctransid , struct btrfs_root_item ,
ctransid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_otransid , struct btrfs_root_item ,
otransid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_stransid , struct btrfs_root_item ,
stransid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( root_rtransid , struct btrfs_root_item ,
rtransid , 64 ) ;
2007-03-14 14:14:43 -04:00
2017-06-28 21:56:53 -06:00
static inline bool btrfs_root_readonly ( const struct btrfs_root * root )
2010-12-20 16:04:08 +08:00
{
2012-04-13 11:49:04 -04:00
return ( root - > root_item . flags & cpu_to_le64 ( BTRFS_ROOT_SUBVOL_RDONLY ) ) ! = 0 ;
2010-12-20 16:04:08 +08:00
}
2017-06-28 21:56:53 -06:00
static inline bool btrfs_root_dead ( const struct btrfs_root * root )
2014-04-15 16:41:44 +02:00
{
return ( root - > root_item . flags & cpu_to_le64 ( BTRFS_ROOT_SUBVOL_DEAD ) ) ! = 0 ;
}
2011-11-03 15:17:42 -04:00
/* struct btrfs_root_backup */
BTRFS_SETGET_STACK_FUNCS ( backup_tree_root , struct btrfs_root_backup ,
tree_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_tree_root_gen , struct btrfs_root_backup ,
tree_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_tree_root_level , struct btrfs_root_backup ,
tree_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_chunk_root , struct btrfs_root_backup ,
chunk_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_chunk_root_gen , struct btrfs_root_backup ,
chunk_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_chunk_root_level , struct btrfs_root_backup ,
chunk_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_extent_root , struct btrfs_root_backup ,
extent_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_extent_root_gen , struct btrfs_root_backup ,
extent_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_extent_root_level , struct btrfs_root_backup ,
extent_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_fs_root , struct btrfs_root_backup ,
fs_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_fs_root_gen , struct btrfs_root_backup ,
fs_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_fs_root_level , struct btrfs_root_backup ,
fs_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_dev_root , struct btrfs_root_backup ,
dev_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_dev_root_gen , struct btrfs_root_backup ,
dev_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_dev_root_level , struct btrfs_root_backup ,
dev_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_csum_root , struct btrfs_root_backup ,
csum_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_csum_root_gen , struct btrfs_root_backup ,
csum_root_gen , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_csum_root_level , struct btrfs_root_backup ,
csum_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_total_bytes , struct btrfs_root_backup ,
total_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_bytes_used , struct btrfs_root_backup ,
bytes_used , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( backup_num_devices , struct btrfs_root_backup ,
num_devices , 64 ) ;
2012-01-16 22:04:48 +02:00
/* struct btrfs_balance_item */
BTRFS_SETGET_FUNCS ( balance_flags , struct btrfs_balance_item , flags , 64 ) ;
2008-12-02 07:17:45 -05:00
2017-06-28 21:56:53 -06:00
static inline void btrfs_balance_data ( const struct extent_buffer * eb ,
const struct btrfs_balance_item * bi ,
2012-01-16 22:04:48 +02:00
struct btrfs_disk_balance_args * ba )
{
read_eb_member ( eb , bi , struct btrfs_balance_item , data , ba ) ;
}
static inline void btrfs_set_balance_data ( struct extent_buffer * eb ,
2017-06-28 21:56:53 -06:00
struct btrfs_balance_item * bi ,
const struct btrfs_disk_balance_args * ba )
2012-01-16 22:04:48 +02:00
{
write_eb_member ( eb , bi , struct btrfs_balance_item , data , ba ) ;
}
2017-06-28 21:56:53 -06:00
static inline void btrfs_balance_meta ( const struct extent_buffer * eb ,
const struct btrfs_balance_item * bi ,
2012-01-16 22:04:48 +02:00
struct btrfs_disk_balance_args * ba )
{
read_eb_member ( eb , bi , struct btrfs_balance_item , meta , ba ) ;
}
static inline void btrfs_set_balance_meta ( struct extent_buffer * eb ,
2017-06-28 21:56:53 -06:00
struct btrfs_balance_item * bi ,
const struct btrfs_disk_balance_args * ba )
2012-01-16 22:04:48 +02:00
{
write_eb_member ( eb , bi , struct btrfs_balance_item , meta , ba ) ;
}
2017-06-28 21:56:53 -06:00
static inline void btrfs_balance_sys ( const struct extent_buffer * eb ,
const struct btrfs_balance_item * bi ,
2012-01-16 22:04:48 +02:00
struct btrfs_disk_balance_args * ba )
{
read_eb_member ( eb , bi , struct btrfs_balance_item , sys , ba ) ;
}
static inline void btrfs_set_balance_sys ( struct extent_buffer * eb ,
2017-06-28 21:56:53 -06:00
struct btrfs_balance_item * bi ,
const struct btrfs_disk_balance_args * ba )
2012-01-16 22:04:48 +02:00
{
write_eb_member ( eb , bi , struct btrfs_balance_item , sys , ba ) ;
}
static inline void
btrfs_disk_balance_args_to_cpu ( struct btrfs_balance_args * cpu ,
2017-06-28 21:56:53 -06:00
const struct btrfs_disk_balance_args * disk )
2012-01-16 22:04:48 +02:00
{
memset ( cpu , 0 , sizeof ( * cpu ) ) ;
cpu - > profiles = le64_to_cpu ( disk - > profiles ) ;
cpu - > usage = le64_to_cpu ( disk - > usage ) ;
cpu - > devid = le64_to_cpu ( disk - > devid ) ;
cpu - > pstart = le64_to_cpu ( disk - > pstart ) ;
cpu - > pend = le64_to_cpu ( disk - > pend ) ;
cpu - > vstart = le64_to_cpu ( disk - > vstart ) ;
cpu - > vend = le64_to_cpu ( disk - > vend ) ;
cpu - > target = le64_to_cpu ( disk - > target ) ;
cpu - > flags = le64_to_cpu ( disk - > flags ) ;
2014-05-07 17:37:51 +02:00
cpu - > limit = le64_to_cpu ( disk - > limit ) ;
2016-11-01 14:21:23 +01:00
cpu - > stripes_min = le32_to_cpu ( disk - > stripes_min ) ;
cpu - > stripes_max = le32_to_cpu ( disk - > stripes_max ) ;
2012-01-16 22:04:48 +02:00
}
static inline void
btrfs_cpu_balance_args_to_disk ( struct btrfs_disk_balance_args * disk ,
2017-06-28 21:56:53 -06:00
const struct btrfs_balance_args * cpu )
2012-01-16 22:04:48 +02:00
{
memset ( disk , 0 , sizeof ( * disk ) ) ;
disk - > profiles = cpu_to_le64 ( cpu - > profiles ) ;
disk - > usage = cpu_to_le64 ( cpu - > usage ) ;
disk - > devid = cpu_to_le64 ( cpu - > devid ) ;
disk - > pstart = cpu_to_le64 ( cpu - > pstart ) ;
disk - > pend = cpu_to_le64 ( cpu - > pend ) ;
disk - > vstart = cpu_to_le64 ( cpu - > vstart ) ;
disk - > vend = cpu_to_le64 ( cpu - > vend ) ;
disk - > target = cpu_to_le64 ( cpu - > target ) ;
disk - > flags = cpu_to_le64 ( cpu - > flags ) ;
2014-05-07 17:37:51 +02:00
disk - > limit = cpu_to_le64 ( cpu - > limit ) ;
2016-11-01 14:21:23 +01:00
disk - > stripes_min = cpu_to_le32 ( cpu - > stripes_min ) ;
disk - > stripes_max = cpu_to_le32 ( cpu - > stripes_max ) ;
2012-01-16 22:04:48 +02:00
}
/* struct btrfs_super_block */
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_bytenr , struct btrfs_super_block , bytenr , 64 ) ;
2008-05-07 11:43:44 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_flags , struct btrfs_super_block , flags , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_generation , struct btrfs_super_block ,
generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_root , struct btrfs_super_block , root , 64 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_sys_array_size ,
struct btrfs_super_block , sys_chunk_array_size , 32 ) ;
2008-10-29 14:49:05 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_chunk_root_generation ,
struct btrfs_super_block , chunk_root_generation , 64 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_root_level , struct btrfs_super_block ,
root_level , 8 ) ;
2008-03-24 15:01:56 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_chunk_root , struct btrfs_super_block ,
chunk_root , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_chunk_root_level , struct btrfs_super_block ,
2008-09-05 16:13:11 -04:00
chunk_root_level , 8 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_log_root , struct btrfs_super_block ,
log_root , 64 ) ;
2008-12-08 16:40:21 -05:00
BTRFS_SETGET_STACK_FUNCS ( super_log_root_transid , struct btrfs_super_block ,
log_root_transid , 64 ) ;
2008-09-05 16:13:11 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_log_root_level , struct btrfs_super_block ,
log_root_level , 8 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_total_bytes , struct btrfs_super_block ,
total_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_bytes_used , struct btrfs_super_block ,
bytes_used , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_sectorsize , struct btrfs_super_block ,
sectorsize , 32 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_nodesize , struct btrfs_super_block ,
nodesize , 32 ) ;
2007-11-30 11:30:34 -05:00
BTRFS_SETGET_STACK_FUNCS ( super_stripesize , struct btrfs_super_block ,
stripesize , 32 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_root_dir , struct btrfs_super_block ,
root_dir_objectid , 64 ) ;
2008-03-24 15:02:07 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_num_devices , struct btrfs_super_block ,
num_devices , 64 ) ;
2008-12-02 06:36:08 -05:00
BTRFS_SETGET_STACK_FUNCS ( super_compat_flags , struct btrfs_super_block ,
compat_flags , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( super_compat_ro_flags , struct btrfs_super_block ,
2009-12-17 21:32:27 +00:00
compat_ro_flags , 64 ) ;
2008-12-02 06:36:08 -05:00
BTRFS_SETGET_STACK_FUNCS ( super_incompat_flags , struct btrfs_super_block ,
incompat_flags , 64 ) ;
2008-12-02 07:17:45 -05:00
BTRFS_SETGET_STACK_FUNCS ( super_csum_type , struct btrfs_super_block ,
csum_type , 16 ) ;
2010-06-21 14:48:16 -04:00
BTRFS_SETGET_STACK_FUNCS ( super_cache_generation , struct btrfs_super_block ,
cache_generation , 64 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( super_magic , struct btrfs_super_block , magic , 64 ) ;
2013-08-15 17:11:22 +02:00
BTRFS_SETGET_STACK_FUNCS ( super_uuid_tree_generation , struct btrfs_super_block ,
uuid_tree_generation , 64 ) ;
2008-12-02 07:17:45 -05:00
2019-08-30 13:36:09 +02:00
int btrfs_super_csum_size ( const struct btrfs_super_block * s ) ;
const char * btrfs_super_csum_name ( u16 csum_type ) ;
2019-10-08 18:41:33 +02:00
const char * btrfs_super_csum_driver ( u16 csum_type ) ;
2020-07-27 17:38:19 +02:00
size_t __attribute_const__ btrfs_get_num_csums ( void ) ;
2019-10-07 11:11:03 +02:00
2007-03-21 11:12:56 -04:00
2016-09-23 13:44:44 -07:00
/*
* The leaf data grows from end - to - front in the node .
* this returns the address of the start of the last item ,
* which is the stop of the leaf data stack
*/
2019-03-20 11:33:10 +01:00
static inline unsigned int leaf_data_end ( const struct extent_buffer * leaf )
2016-09-23 13:44:44 -07:00
{
u32 nr = btrfs_header_nritems ( leaf ) ;
if ( nr = = 0 )
2019-03-20 11:33:10 +01:00
return BTRFS_LEAF_DATA_SIZE ( leaf - > fs_info ) ;
2016-09-23 13:44:44 -07:00
return btrfs_item_offset_nr ( leaf , nr - 1 ) ;
}
2007-10-15 16:14:19 -04:00
/* struct btrfs_file_extent_item */
btrfs: inode: refactor the parameters of insert_reserved_file_extent()
Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.
This makes the parameter list unnecessary long for a function which only
gets called twice.
This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.
Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().
These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.
So use two small wrapper for these two different call sites to improve
readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-10 09:04:40 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_type , struct btrfs_file_extent_item ,
type , 8 ) ;
2013-07-16 11:19:18 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_disk_bytenr ,
struct btrfs_file_extent_item , disk_bytenr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_offset ,
struct btrfs_file_extent_item , offset , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_generation ,
struct btrfs_file_extent_item , generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_num_bytes ,
struct btrfs_file_extent_item , num_bytes , 64 ) ;
btrfs: inode: refactor the parameters of insert_reserved_file_extent()
Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.
This makes the parameter list unnecessary long for a function which only
gets called twice.
This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.
Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().
These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.
So use two small wrapper for these two different call sites to improve
readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-10 09:04:40 +08:00
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_ram_bytes ,
struct btrfs_file_extent_item , ram_bytes , 64 ) ;
2013-11-13 21:11:49 -05:00
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_disk_num_bytes ,
struct btrfs_file_extent_item , disk_num_bytes , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_file_extent_compression ,
struct btrfs_file_extent_item , compression , 8 ) ;
2007-03-20 14:38:32 -04:00
2009-01-05 21:25:51 -05:00
static inline unsigned long
2017-06-28 21:56:53 -06:00
btrfs_file_extent_inline_start ( const struct btrfs_file_extent_item * e )
2007-04-19 13:37:44 -04:00
{
2014-07-24 17:34:58 +02:00
return ( unsigned long ) e + BTRFS_FILE_EXTENT_INLINE_DATA_START ;
2007-04-19 13:37:44 -04:00
}
static inline u32 btrfs_file_extent_calc_inline_size ( u32 datasize )
{
2014-07-24 17:34:58 +02:00
return BTRFS_FILE_EXTENT_INLINE_DATA_START + datasize ;
2007-03-20 14:38:32 -04:00
}
btrfs: inode: refactor the parameters of insert_reserved_file_extent()
Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.
This makes the parameter list unnecessary long for a function which only
gets called twice.
This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.
Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().
These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.
So use two small wrapper for these two different call sites to improve
readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-10 09:04:40 +08:00
BTRFS_SETGET_FUNCS ( file_extent_type , struct btrfs_file_extent_item , type , 8 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_FUNCS ( file_extent_disk_bytenr , struct btrfs_file_extent_item ,
disk_bytenr , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_FUNCS ( file_extent_generation , struct btrfs_file_extent_item ,
generation , 64 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_FUNCS ( file_extent_disk_num_bytes , struct btrfs_file_extent_item ,
disk_num_bytes , 64 ) ;
2007-10-15 16:14:19 -04:00
BTRFS_SETGET_FUNCS ( file_extent_offset , struct btrfs_file_extent_item ,
offset , 64 ) ;
2007-10-15 16:15:53 -04:00
BTRFS_SETGET_FUNCS ( file_extent_num_bytes , struct btrfs_file_extent_item ,
num_bytes , 64 ) ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
BTRFS_SETGET_FUNCS ( file_extent_ram_bytes , struct btrfs_file_extent_item ,
ram_bytes , 64 ) ;
BTRFS_SETGET_FUNCS ( file_extent_compression , struct btrfs_file_extent_item ,
compression , 8 ) ;
BTRFS_SETGET_FUNCS ( file_extent_encryption , struct btrfs_file_extent_item ,
encryption , 8 ) ;
BTRFS_SETGET_FUNCS ( file_extent_other_encoding , struct btrfs_file_extent_item ,
other_encoding , 16 ) ;
/*
* this returns the number of bytes used by the item on disk , minus the
* size of any extent headers . If a file is compressed on disk , this is
* the compressed size
*/
2017-06-28 21:56:53 -06:00
static inline u32 btrfs_file_extent_inline_item_len (
const struct extent_buffer * eb ,
struct btrfs_item * e )
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
{
2014-07-24 17:34:58 +02:00
return btrfs_item_size ( eb , e ) - BTRFS_FILE_EXTENT_INLINE_DATA_START ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
}
2007-03-20 14:38:32 -04:00
2011-09-13 11:06:07 +02:00
/* btrfs_qgroup_status_item */
BTRFS_SETGET_FUNCS ( qgroup_status_generation , struct btrfs_qgroup_status_item ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_status_version , struct btrfs_qgroup_status_item ,
version , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_status_flags , struct btrfs_qgroup_status_item ,
flags , 64 ) ;
2013-04-25 16:04:51 +00:00
BTRFS_SETGET_FUNCS ( qgroup_status_rescan , struct btrfs_qgroup_status_item ,
rescan , 64 ) ;
2011-09-13 11:06:07 +02:00
/* btrfs_qgroup_info_item */
BTRFS_SETGET_FUNCS ( qgroup_info_generation , struct btrfs_qgroup_info_item ,
generation , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_rfer , struct btrfs_qgroup_info_item , rfer , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_rfer_cmpr , struct btrfs_qgroup_info_item ,
rfer_cmpr , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_excl , struct btrfs_qgroup_info_item , excl , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_info_excl_cmpr , struct btrfs_qgroup_info_item ,
excl_cmpr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_generation ,
struct btrfs_qgroup_info_item , generation , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_rfer , struct btrfs_qgroup_info_item ,
rfer , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_rfer_cmpr ,
struct btrfs_qgroup_info_item , rfer_cmpr , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_excl , struct btrfs_qgroup_info_item ,
excl , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_qgroup_info_excl_cmpr ,
struct btrfs_qgroup_info_item , excl_cmpr , 64 ) ;
/* btrfs_qgroup_limit_item */
BTRFS_SETGET_FUNCS ( qgroup_limit_flags , struct btrfs_qgroup_limit_item ,
flags , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_max_rfer , struct btrfs_qgroup_limit_item ,
max_rfer , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_max_excl , struct btrfs_qgroup_limit_item ,
max_excl , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_rsv_rfer , struct btrfs_qgroup_limit_item ,
rsv_rfer , 64 ) ;
BTRFS_SETGET_FUNCS ( qgroup_limit_rsv_excl , struct btrfs_qgroup_limit_item ,
rsv_excl , 64 ) ;
2012-11-05 17:32:20 +01:00
/* btrfs_dev_replace_item */
BTRFS_SETGET_FUNCS ( dev_replace_src_devid ,
struct btrfs_dev_replace_item , src_devid , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_cont_reading_from_srcdev_mode ,
struct btrfs_dev_replace_item , cont_reading_from_srcdev_mode ,
64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_replace_state , struct btrfs_dev_replace_item ,
replace_state , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_time_started , struct btrfs_dev_replace_item ,
time_started , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_time_stopped , struct btrfs_dev_replace_item ,
time_stopped , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_num_write_errors , struct btrfs_dev_replace_item ,
num_write_errors , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_num_uncorrectable_read_errors ,
struct btrfs_dev_replace_item , num_uncorrectable_read_errors ,
64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_cursor_left , struct btrfs_dev_replace_item ,
cursor_left , 64 ) ;
BTRFS_SETGET_FUNCS ( dev_replace_cursor_right , struct btrfs_dev_replace_item ,
cursor_right , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_src_devid ,
struct btrfs_dev_replace_item , src_devid , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_cont_reading_from_srcdev_mode ,
struct btrfs_dev_replace_item ,
cont_reading_from_srcdev_mode , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_replace_state ,
struct btrfs_dev_replace_item , replace_state , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_time_started ,
struct btrfs_dev_replace_item , time_started , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_time_stopped ,
struct btrfs_dev_replace_item , time_stopped , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_num_write_errors ,
struct btrfs_dev_replace_item , num_write_errors , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_num_uncorrectable_read_errors ,
struct btrfs_dev_replace_item ,
num_uncorrectable_read_errors , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_cursor_left ,
struct btrfs_dev_replace_item , cursor_left , 64 ) ;
BTRFS_SETGET_STACK_FUNCS ( stack_dev_replace_cursor_right ,
struct btrfs_dev_replace_item , cursor_right , 64 ) ;
2007-03-14 10:31:29 -04:00
/* helper function to cast into the data area of the leaf. */
# define btrfs_item_ptr(leaf, slot, type) \
2017-05-29 09:43:43 +03:00
( ( type * ) ( BTRFS_LEAF_DATA_OFFSET + \
2007-10-15 16:14:19 -04:00
btrfs_item_offset_nr ( leaf , slot ) ) )
# define btrfs_item_ptr_offset(leaf, slot) \
2017-05-29 09:43:43 +03:00
( ( unsigned long ) ( BTRFS_LEAF_DATA_OFFSET + \
2007-10-15 16:14:19 -04:00
btrfs_item_offset_nr ( leaf , slot ) ) )
2007-03-14 10:31:29 -04:00
2019-05-22 10:18:59 +02:00
static inline u32 btrfs_crc32c ( u32 crc , const void * address , unsigned length )
{
return crc32c ( crc , address , length ) ;
}
static inline void btrfs_crc32c_final ( u32 crc , u8 * result )
{
put_unaligned_le32 ( ~ crc , result ) ;
}
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 11:45:05 +02:00
static inline u64 btrfs_name_hash ( const char * name , int len )
{
return crc32c ( ( u32 ) ~ 1 , name , len ) ;
}
/*
* Figure the key offset of an extended inode ref
*/
static inline u64 btrfs_extref_hash ( u64 parent_objectid , const char * name ,
int len )
{
return ( u64 ) crc32c ( parent_objectid , name , len ) ;
}
2011-09-21 15:05:58 -04:00
static inline gfp_t btrfs_alloc_write_mask ( struct address_space * mapping )
{
2015-11-06 16:28:49 -08:00
return mapping_gfp_constraint ( mapping , ~ __GFP_FS ) ;
2011-09-21 15:05:58 -04:00
}
2007-04-17 13:26:50 -04:00
/* extent-tree.c */
2015-02-04 06:59:29 -08:00
2017-08-18 15:15:18 -06:00
enum btrfs_inline_ref_type {
2018-11-27 15:25:13 +01:00
BTRFS_REF_TYPE_INVALID ,
BTRFS_REF_TYPE_BLOCK ,
BTRFS_REF_TYPE_DATA ,
BTRFS_REF_TYPE_ANY ,
2017-08-18 15:15:18 -06:00
} ;
int btrfs_get_extent_inline_ref_type ( const struct extent_buffer * eb ,
struct btrfs_extent_inline_ref * iref ,
enum btrfs_inline_ref_type is_data ) ;
2019-08-09 09:24:24 +08:00
u64 hash_extent_data_ref ( u64 root_objectid , u64 owner , u64 offset ) ;
2017-08-18 15:15:18 -06:00
2020-07-02 10:54:11 +02:00
/*
* Take the number of bytes to be checksummmed and figure out how many leaves
* it would require to store the csums for that many bytes .
*/
static inline u64 btrfs_csum_bytes_to_leaves (
const struct btrfs_fs_info * fs_info , u64 csum_bytes )
{
const u64 num_csums = csum_bytes > > fs_info - > sectorsize_bits ;
return DIV_ROUND_UP_ULL ( num_csums , fs_info - > csums_per_leaf ) ;
}
2015-02-04 06:59:29 -08:00
2019-08-22 15:14:33 -04:00
/*
* Use this if we would be adding new items , as we could split nodes as we cow
* down the tree .
*/
static inline u64 btrfs_calc_insert_metadata_size ( struct btrfs_fs_info * fs_info ,
unsigned num_items )
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
{
Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().
In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().
The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0);
Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-02 01:20:01 -07:00
return ( u64 ) fs_info - > nodesize * BTRFS_MAX_LEVEL * 2 * num_items ;
2011-08-19 10:29:59 -04:00
}
/*
2019-08-22 15:14:33 -04:00
* Doing a truncate or a modification won ' t result in new nodes or leaves , just
* what we need for COW .
2011-08-19 10:29:59 -04:00
*/
2019-08-22 15:14:33 -04:00
static inline u64 btrfs_calc_metadata_size ( struct btrfs_fs_info * fs_info ,
2011-08-19 10:29:59 -04:00
unsigned num_items )
{
Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().
In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().
The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0);
Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-02 01:20:01 -07:00
return ( u64 ) fs_info - > nodesize * BTRFS_MAX_LEVEL * num_items ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
}
2019-06-20 15:37:49 -04:00
int btrfs_add_excluded_extent ( struct btrfs_fs_info * fs_info ,
u64 start , u64 num_bytes ) ;
2019-10-29 19:20:18 +01:00
void btrfs_free_excluded_extents ( struct btrfs_block_group * cache ) ;
2009-03-13 10:10:06 -04:00
int btrfs_run_delayed_refs ( struct btrfs_trans_handle * trans ,
2018-03-15 17:27:37 +02:00
unsigned long count ) ;
2018-11-21 14:05:41 -05:00
void btrfs_cleanup_ref_head_accounting ( struct btrfs_fs_info * fs_info ,
struct btrfs_delayed_ref_root * delayed_refs ,
struct btrfs_delayed_ref_head * head ) ;
2016-06-22 18:54:24 -04:00
int btrfs_lookup_data_extent ( struct btrfs_fs_info * fs_info , u64 start , u64 len ) ;
2010-05-16 10:48:46 -04:00
int btrfs_lookup_extent_info ( struct btrfs_trans_handle * trans ,
2016-06-22 18:54:24 -04:00
struct btrfs_fs_info * fs_info , u64 bytenr ,
2013-03-07 14:22:04 -05:00
u64 offset , int metadata , u64 * refs , u64 * flags ) ;
2020-01-20 16:09:09 +02:00
int btrfs_pin_extent ( struct btrfs_trans_handle * trans , u64 bytenr , u64 num ,
int reserved ) ;
2020-01-20 16:09:13 +02:00
int btrfs_pin_extent_for_log_replay ( struct btrfs_trans_handle * trans ,
2011-10-31 20:52:39 -04:00
u64 bytenr , u64 num_bytes ) ;
2019-03-20 12:14:33 +01:00
int btrfs_exclude_logged_extents ( struct extent_buffer * eb ) ;
2017-01-30 12:25:28 -08:00
int btrfs_cross_ref_exist ( struct btrfs_root * root ,
2020-08-18 11:00:05 -07:00
u64 objectid , u64 offset , u64 bytenr , bool strict ) ;
2014-06-15 01:54:12 +02:00
struct extent_buffer * btrfs_alloc_tree_block ( struct btrfs_trans_handle * trans ,
2017-01-17 23:24:37 -08:00
struct btrfs_root * root ,
u64 parent , u64 root_objectid ,
const struct btrfs_disk_key * key ,
int level , u64 hint ,
2020-08-20 11:46:03 -04:00
u64 empty_size ,
enum btrfs_lock_nesting nest ) ;
2010-05-16 10:46:25 -04:00
void btrfs_free_tree_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf ,
2012-05-16 17:04:52 +02:00
u64 parent , int last_ref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_alloc_reserved_file_extent ( struct btrfs_trans_handle * trans ,
2017-09-29 15:43:49 -04:00
struct btrfs_root * root , u64 owner ,
2015-10-26 14:11:18 +08:00
u64 offset , u64 ram_bytes ,
struct btrfs_key * ins ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_alloc_logged_file_extent ( struct btrfs_trans_handle * trans ,
u64 root_objectid , u64 owner , u64 offset ,
struct btrfs_key * ins ) ;
btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.
Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |
In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.
To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.
As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.
Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-07-25 15:51:40 +08:00
int btrfs_reserve_extent ( struct btrfs_root * root , u64 ram_bytes , u64 num_bytes ,
2013-08-14 14:02:47 -04:00
u64 min_alloc_size , u64 empty_size , u64 hint_byte ,
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 10:42:50 +08:00
struct btrfs_key * ins , int is_data , int delalloc ) ;
2007-03-16 16:20:31 -04:00
int btrfs_inc_ref ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
2014-07-02 10:54:25 -07:00
struct extent_buffer * buf , int full_backref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_dec_ref ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
2014-07-02 10:54:25 -07:00
struct extent_buffer * buf , int full_backref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_set_disk_extent_flags ( struct btrfs_trans_handle * trans ,
2019-03-20 11:54:13 +01:00
struct extent_buffer * eb , u64 flags ,
2013-05-09 13:49:30 -04:00
int level , int is_data ) ;
2019-04-04 14:45:36 +08:00
int btrfs_free_extent ( struct btrfs_trans_handle * trans , struct btrfs_ref * ref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
2016-06-22 18:54:24 -04:00
int btrfs_free_reserved_extent ( struct btrfs_fs_info * fs_info ,
u64 start , u64 len , int delalloc ) ;
2020-01-20 16:09:12 +02:00
int btrfs_pin_reserved_extent ( struct btrfs_trans_handle * trans , u64 start ,
2019-11-21 14:03:31 +02:00
u64 len ) ;
2018-03-15 16:00:26 +02:00
int btrfs_finish_extent_commit ( struct btrfs_trans_handle * trans ) ;
2007-04-17 13:26:50 -04:00
int btrfs_inc_extent_ref ( struct btrfs_trans_handle * trans ,
2019-04-04 14:45:35 +08:00
struct btrfs_ref * generic_ref ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
2009-03-10 12:39:20 -04:00
void btrfs_clear_space_info_full ( struct btrfs_fs_info * info ) ;
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
2020-07-13 09:03:22 +08:00
/*
* Different levels for to flush space when doing space reservations .
*
* The higher the level , the more methods we try to reclaim space .
*/
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
enum btrfs_reserve_flush_enum {
/* If we are in the transaction, we can't flush anything.*/
BTRFS_RESERVE_NO_FLUSH ,
2020-07-13 09:03:22 +08:00
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
/*
2020-07-13 09:03:22 +08:00
* Flush space by :
* - Running delayed inode items
* - Allocating a new chunk
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
*/
BTRFS_RESERVE_FLUSH_LIMIT ,
2020-07-13 09:03:22 +08:00
/*
* Flush space by :
* - Running delayed inode items
* - Running delayed refs
* - Running delalloc and waiting for ordered extents
* - Allocating a new chunk
*/
2019-08-01 18:19:37 -04:00
BTRFS_RESERVE_FLUSH_EVICT ,
2020-07-13 09:03:22 +08:00
/*
* Flush space by above mentioned methods and by :
* - Running delayed iputs
* - Commiting transaction
*
* Can be interruped by fatal signal .
*/
2020-07-21 10:22:23 -04:00
BTRFS_RESERVE_FLUSH_DATA ,
BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE ,
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
BTRFS_RESERVE_FLUSH_ALL ,
2020-07-13 09:03:22 +08:00
/*
* Pretty much the same as FLUSH_ALL , but can also steal space from
* global rsv .
*
* Can be interruped by fatal signal .
*/
2020-03-13 15:58:05 -04:00
BTRFS_RESERVE_FLUSH_ALL_STEAL ,
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
} ;
2016-03-25 13:25:56 -04:00
enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR = 1 ,
FLUSH_DELAYED_ITEMS = 2 ,
2018-12-03 10:20:35 -05:00
FLUSH_DELAYED_REFS_NR = 3 ,
FLUSH_DELAYED_REFS = 4 ,
FLUSH_DELALLOC = 5 ,
FLUSH_DELALLOC_WAIT = 6 ,
ALLOC_CHUNK = 7 ,
2018-11-21 14:03:08 -05:00
ALLOC_CHUNK_FORCE = 8 ,
2019-08-01 18:19:33 -04:00
RUN_DELAYED_IPUTS = 9 ,
COMMIT_TRANS = 10 ,
2020-10-09 09:28:21 -04:00
FORCE_COMMIT_TRANS = 11 ,
2016-03-25 13:25:56 -04:00
} ;
2013-02-28 10:04:33 +00:00
int btrfs_subvolume_reserve_metadata ( struct btrfs_root * root ,
struct btrfs_block_rsv * rsv ,
2018-05-30 11:00:38 +08:00
int nitems , bool use_global_rsv ) ;
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
And with the following metadata leak:
BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents
[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().
This leads to the leakage.
[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.
And free the qgroup reserved metadata space when releasing the
block_rsv.
To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.
Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-24 14:46:10 +08:00
void btrfs_subvolume_release_metadata ( struct btrfs_root * root ,
2017-02-10 19:18:18 +01:00
struct btrfs_block_rsv * rsv ) ;
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e36a ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-14 14:34:51 +08:00
void btrfs_delalloc_release_extents ( struct btrfs_inode * inode , u64 num_bytes ) ;
2017-10-19 14:15:55 -04:00
2017-02-20 13:50:41 +02:00
int btrfs_delalloc_reserve_metadata ( struct btrfs_inode * inode , u64 num_bytes ) ;
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 10:07:31 +00:00
u64 btrfs_account_ro_block_groups_free_space ( struct btrfs_space_info * sinfo ) ;
2016-06-22 18:54:24 -04:00
int btrfs_error_unpin_extent_range ( struct btrfs_fs_info * fs_info ,
2011-01-06 19:30:25 +08:00
u64 start , u64 end ) ;
2016-06-22 18:54:24 -04:00
int btrfs_discard_extent ( struct btrfs_fs_info * fs_info , u64 bytenr ,
2014-12-08 14:01:12 +00:00
u64 num_bytes , u64 * actual_bytes ) ;
2016-06-22 18:54:24 -04:00
int btrfs_trim_fs ( struct btrfs_fs_info * fs_info , struct fstrim_range * range ) ;
2011-01-06 19:30:25 +08:00
2011-03-07 02:13:14 +00:00
int btrfs_init_space_info ( struct btrfs_fs_info * fs_info ) ;
2012-06-28 18:03:02 +02:00
int btrfs_delayed_refs_qgroup_accounting ( struct btrfs_trans_handle * trans ,
struct btrfs_fs_info * fs_info ) ;
2017-06-22 02:19:11 +02:00
int btrfs_start_write_no_snapshotting ( struct btrfs_root * root ) ;
void btrfs_end_write_no_snapshotting ( struct btrfs_root * root ) ;
2016-01-06 18:56:36 +08:00
void btrfs_wait_for_snapshot_creation ( struct btrfs_root * root ) ;
2015-09-29 20:50:35 -07:00
2007-03-26 16:00:06 -04:00
/* ctree.c */
2017-01-17 23:24:37 -08:00
int btrfs_bin_search ( struct extent_buffer * eb , const struct btrfs_key * key ,
2020-04-17 15:08:21 +08:00
int * slot ) ;
2019-10-01 19:57:39 +02:00
int __pure btrfs_comp_cpu_keys ( const struct btrfs_key * k1 , const struct btrfs_key * k2 ) ;
2008-03-24 15:01:56 -04:00
int btrfs_previous_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 min_objectid ,
int type ) ;
2014-01-12 21:38:33 +08:00
int btrfs_previous_extent_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 min_objectid ) ;
2014-11-12 13:43:09 +09:00
void btrfs_set_item_key_safe ( struct btrfs_fs_info * fs_info ,
struct btrfs_path * path ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * new_key ) ;
2008-06-25 16:01:30 -04:00
struct extent_buffer * btrfs_root_node ( struct btrfs_root * root ) ;
2008-06-25 16:01:31 -04:00
int btrfs_find_next_key ( struct btrfs_root * root , struct btrfs_path * path ,
2008-06-25 16:01:31 -04:00
struct btrfs_key * key , int lowest_level ,
2013-01-31 18:21:12 +00:00
u64 min_trans ) ;
2008-06-25 16:01:31 -04:00
int btrfs_search_forward ( struct btrfs_root * root , struct btrfs_key * min_key ,
2013-01-31 18:21:12 +00:00
struct btrfs_path * path ,
2008-06-25 16:01:31 -04:00
u64 min_trans ) ;
2019-08-21 19:16:27 +02:00
struct extent_buffer * btrfs_read_node_slot ( struct extent_buffer * parent ,
int slot ) ;
2007-10-15 16:14:19 -04:00
int btrfs_cow_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct extent_buffer * buf ,
struct extent_buffer * parent , int parent_slot ,
2020-08-20 11:46:03 -04:00
struct extent_buffer * * cow_ret ,
enum btrfs_lock_nesting nest ) ;
2007-12-17 20:14:01 -05:00
int btrfs_copy_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * buf ,
struct extent_buffer * * cow_ret , u64 new_root_objectid ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_block_can_be_shared ( struct btrfs_root * root ,
struct extent_buffer * buf ) ;
2019-03-20 14:51:10 +01:00
void btrfs_extend_item ( struct btrfs_path * path , u32 data_size ) ;
2019-03-20 14:49:12 +01:00
void btrfs_truncate_item ( struct btrfs_path * path , u32 new_size , int from_end ) ;
2008-12-10 09:10:46 -05:00
int btrfs_split_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * new_key ,
2008-12-10 09:10:46 -05:00
unsigned long split_offset ) ;
2009-11-12 09:33:58 +00:00
int btrfs_duplicate_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * new_key ) ;
2013-11-04 19:33:33 -08:00
int btrfs_find_item ( struct btrfs_root * fs_root , struct btrfs_path * path ,
u64 inum , u64 ioff , u8 key_type , struct btrfs_key * found_key ) ;
2017-01-17 23:24:37 -08:00
int btrfs_search_slot ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key , struct btrfs_path * p ,
int ins_len , int cow ) ;
int btrfs_search_old_slot ( struct btrfs_root * root , const struct btrfs_key * key ,
2012-05-16 18:25:47 +02:00
struct btrfs_path * p , u64 time_seq ) ;
2011-09-13 11:18:10 +02:00
int btrfs_search_slot_for_read ( struct btrfs_root * root ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * key ,
struct btrfs_path * p , int find_higher ,
int return_any ) ;
2007-08-07 16:15:09 -04:00
int btrfs_realloc_node ( struct btrfs_trans_handle * trans ,
2007-10-15 16:14:19 -04:00
struct btrfs_root * root , struct extent_buffer * parent ,
2013-01-31 18:21:12 +00:00
int start_slot , u64 * last_ret ,
2007-10-15 16:22:39 -04:00
struct btrfs_key * progress ) ;
2011-04-21 01:20:15 +02:00
void btrfs_release_path ( struct btrfs_path * p ) ;
2007-04-02 10:50:19 -04:00
struct btrfs_path * btrfs_alloc_path ( void ) ;
void btrfs_free_path ( struct btrfs_path * p ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
2008-01-29 15:11:36 -05:00
int btrfs_del_items ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
struct btrfs_path * path , int slot , int nr ) ;
static inline int btrfs_del_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path )
{
return btrfs_del_items ( trans , root , path , path - > slots [ 0 ] , 1 ) ;
}
2013-04-16 05:18:22 +00:00
void setup_items_for_insert ( struct btrfs_root * root , struct btrfs_path * path ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * cpu_key , u32 * data_size ,
2020-09-01 17:39:59 +03:00
int nr ) ;
2017-01-17 23:24:37 -08:00
int btrfs_insert_item ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key , void * data , u32 data_size ) ;
2008-01-29 15:15:18 -05:00
int btrfs_insert_empty_items ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * cpu_key , u32 * data_size ,
int nr ) ;
2008-01-29 15:15:18 -05:00
static inline int btrfs_insert_empty_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
2017-01-17 23:24:37 -08:00
const struct btrfs_key * key ,
2008-01-29 15:15:18 -05:00
u32 data_size )
{
return btrfs_insert_empty_items ( trans , root , path , key , & data_size , 1 ) ;
}
2007-03-13 10:46:10 -04:00
int btrfs_next_leaf ( struct btrfs_root * root , struct btrfs_path * path ) ;
2013-10-22 12:18:51 -04:00
int btrfs_prev_leaf ( struct btrfs_root * root , struct btrfs_path * path ) ;
2012-06-11 08:29:29 +02:00
int btrfs_next_old_leaf ( struct btrfs_root * root , struct btrfs_path * path ,
u64 time_seq ) ;
2012-06-19 07:42:25 -06:00
static inline int btrfs_next_old_item ( struct btrfs_root * root ,
struct btrfs_path * p , u64 time_seq )
2011-11-22 15:14:33 +01:00
{
+ + p - > slots [ 0 ] ;
if ( p - > slots [ 0 ] > = btrfs_header_nritems ( p - > nodes [ 0 ] ) )
2012-06-19 07:42:25 -06:00
return btrfs_next_old_leaf ( root , p , time_seq ) ;
2011-11-22 15:14:33 +01:00
return 0 ;
}
2012-06-19 07:42:25 -06:00
static inline int btrfs_next_item ( struct btrfs_root * root , struct btrfs_path * p )
{
return btrfs_next_old_item ( root , p , 0 ) ;
}
2019-03-20 14:36:46 +01:00
int btrfs_leaf_free_space ( struct extent_buffer * leaf ) ;
2020-03-10 11:43:51 +02:00
int __must_check btrfs_drop_snapshot ( struct btrfs_root * root , int update_ref ,
int for_reloc ) ;
2008-10-29 14:49:05 -04:00
int btrfs_drop_subtree ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct extent_buffer * node ,
struct extent_buffer * parent ) ;
2011-05-31 18:07:27 +02:00
static inline int btrfs_fs_closing ( struct btrfs_fs_info * fs_info )
{
/*
2016-09-02 15:40:02 -04:00
* Do it this way so we only ever do one test_bit in the normal case .
2011-05-31 18:07:27 +02:00
*/
2016-09-02 15:40:02 -04:00
if ( test_bit ( BTRFS_FS_CLOSING_START , & fs_info - > flags ) ) {
if ( test_bit ( BTRFS_FS_CLOSING_DONE , & fs_info - > flags ) )
return 2 ;
return 1 ;
}
return 0 ;
2011-05-31 18:07:27 +02:00
}
2013-05-14 10:20:43 +00:00
/*
* If we remount the fs to be R / O or umount the fs , the cleaner needn ' t do
* anything except sleeping . This function is used to check the status of
* the fs .
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
* We check for BTRFS_FS_STATE_RO to avoid races with a concurrent remount ,
* since setting and checking for SB_RDONLY in the superblock ' s flags is not
* atomic .
2013-05-14 10:20:43 +00:00
*/
2016-06-22 18:54:24 -04:00
static inline int btrfs_need_cleaner_sleep ( struct btrfs_fs_info * fs_info )
2013-05-14 10:20:43 +00:00
{
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
return test_bit ( BTRFS_FS_STATE_RO , & fs_info - > fs_state ) | |
btrfs_fs_closing ( fs_info ) ;
}
static inline void btrfs_set_sb_rdonly ( struct super_block * sb )
{
sb - > s_flags | = SB_RDONLY ;
set_bit ( BTRFS_FS_STATE_RO , & btrfs_sb ( sb ) - > fs_state ) ;
}
static inline void btrfs_clear_sb_rdonly ( struct super_block * sb )
{
sb - > s_flags & = ~ SB_RDONLY ;
clear_bit ( BTRFS_FS_STATE_RO , & btrfs_sb ( sb ) - > fs_state ) ;
2013-05-14 10:20:43 +00:00
}
2012-06-21 11:08:04 +02:00
/* tree mod log functions from ctree.c */
u64 btrfs_get_tree_mod_seq ( struct btrfs_fs_info * fs_info ,
struct seq_list * elem ) ;
void btrfs_put_tree_mod_seq ( struct btrfs_fs_info * fs_info ,
struct seq_list * elem ) ;
2012-10-23 11:28:27 +02:00
int btrfs_old_root_level ( struct btrfs_root * root , u64 time_seq ) ;
2012-06-21 11:08:04 +02:00
2007-03-26 16:00:06 -04:00
/* root-item.c */
2018-08-01 11:32:29 +08:00
int btrfs_add_root_ref ( struct btrfs_trans_handle * trans , u64 root_id ,
u64 ref_id , u64 dirid , u64 sequence , const char * name ,
int name_len ) ;
2018-08-01 11:32:28 +08:00
int btrfs_del_root_ref ( struct btrfs_trans_handle * trans , u64 root_id ,
u64 ref_id , u64 dirid , u64 * sequence , const char * name ,
int name_len ) ;
2017-08-17 10:25:11 -04:00
int btrfs_del_root ( struct btrfs_trans_handle * trans ,
2018-08-01 11:32:27 +08:00
const struct btrfs_key * key ) ;
2017-01-17 23:24:37 -08:00
int btrfs_insert_root ( struct btrfs_trans_handle * trans , struct btrfs_root * root ,
const struct btrfs_key * key ,
struct btrfs_root_item * item ) ;
2011-10-03 23:22:44 -04:00
int __must_check btrfs_update_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_key * key ,
struct btrfs_root_item * item ) ;
2017-01-17 23:24:37 -08:00
int btrfs_find_root ( struct btrfs_root * root , const struct btrfs_key * search_key ,
2013-05-15 07:48:19 +00:00
struct btrfs_path * path , struct btrfs_root_item * root_item ,
struct btrfs_key * root_key ) ;
2016-06-21 21:16:51 -04:00
int btrfs_find_orphan_roots ( struct btrfs_fs_info * fs_info ) ;
2011-07-14 21:23:06 +00:00
void btrfs_set_root_node ( struct btrfs_root_item * item ,
struct extent_buffer * node ) ;
2011-03-28 02:01:25 +00:00
void btrfs_check_and_init_root_item ( struct btrfs_root_item * item ) ;
2012-07-25 17:35:53 +02:00
void btrfs_update_root_times ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
2011-03-28 02:01:25 +00:00
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 17:11:17 +02:00
/* uuid-tree.c */
2018-05-29 15:01:53 +08:00
int btrfs_uuid_tree_add ( struct btrfs_trans_handle * trans , u8 * uuid , u8 type ,
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 17:11:17 +02:00
u64 subid ) ;
2018-05-29 15:01:54 +08:00
int btrfs_uuid_tree_remove ( struct btrfs_trans_handle * trans , u8 * uuid , u8 type ,
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 17:11:17 +02:00
u64 subid ) ;
2020-02-18 16:56:07 +02:00
int btrfs_uuid_tree_iterate ( struct btrfs_fs_info * fs_info ) ;
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 17:11:17 +02:00
2007-03-26 16:00:06 -04:00
/* dir-item.c */
2012-12-17 14:26:57 -05:00
int btrfs_check_dir_item_collision ( struct btrfs_root * root , u64 dir ,
const char * name , int name_len ) ;
2018-08-04 21:10:57 +08:00
int btrfs_insert_dir_item ( struct btrfs_trans_handle * trans , const char * name ,
2017-02-20 13:50:31 +02:00
int name_len , struct btrfs_inode * dir ,
2008-07-24 12:12:38 -04:00
struct btrfs_key * location , u8 type , u64 index ) ;
2007-04-19 15:36:27 -04:00
struct btrfs_dir_item * btrfs_lookup_dir_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 dir ,
const char * name , int name_len ,
int mod ) ;
struct btrfs_dir_item *
btrfs_lookup_dir_index_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 dir ,
u64 objectid , const char * name , int name_len ,
int mod ) ;
2009-09-21 15:56:00 -04:00
struct btrfs_dir_item *
btrfs_search_dir_index_item ( struct btrfs_root * root ,
struct btrfs_path * path , u64 dirid ,
const char * name , int name_len ) ;
2007-04-19 15:36:27 -04:00
int btrfs_delete_one_dir_name ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
struct btrfs_dir_item * di ) ;
2007-11-16 11:45:54 -05:00
int btrfs_insert_xattr_item ( struct btrfs_trans_handle * trans ,
2009-11-12 09:35:27 +00:00
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ,
const char * name , u16 name_len ,
const void * data , u16 data_len ) ;
2007-11-16 11:45:54 -05:00
struct btrfs_dir_item * btrfs_lookup_xattr ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 dir ,
const char * name , u16 name_len ,
int mod ) ;
2016-06-22 18:54:24 -04:00
struct btrfs_dir_item * btrfs_match_dir_item_name ( struct btrfs_fs_info * fs_info ,
2014-11-09 08:38:39 +00:00
struct btrfs_path * path ,
const char * name ,
int name_len ) ;
2008-07-24 12:17:14 -04:00
/* orphan.c */
int btrfs_insert_orphan_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , u64 offset ) ;
int btrfs_del_orphan_item ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , u64 offset ) ;
2009-09-21 15:56:00 -04:00
int btrfs_find_orphan_item ( struct btrfs_root * root , u64 offset ) ;
2008-07-24 12:17:14 -04:00
2007-03-26 16:00:06 -04:00
/* inode-item.c */
2007-12-12 14:38:19 -05:00
int btrfs_insert_inode_ref ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
const char * name , int name_len ,
2008-07-24 12:12:38 -04:00
u64 inode_objectid , u64 ref_objectid , u64 index ) ;
2007-12-12 14:38:19 -05:00
int btrfs_del_inode_ref ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
const char * name , int name_len ,
2008-07-24 12:12:38 -04:00
u64 inode_objectid , u64 ref_objectid , u64 * index ) ;
2007-10-15 16:14:19 -04:00
int btrfs_insert_empty_inode ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ) ;
2007-03-20 15:57:25 -04:00
int btrfs_lookup_inode ( struct btrfs_trans_handle * trans , struct btrfs_root
2007-04-06 15:37:36 -04:00
* root , struct btrfs_path * path ,
struct btrfs_key * location , int mod ) ;
2007-03-26 16:00:06 -04:00
2012-08-08 11:32:27 -07:00
struct btrfs_inode_extref *
btrfs_lookup_inode_extref ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
const char * name , int name_len ,
u64 inode_objectid , u64 ref_objectid , int ins_len ,
int cow ) ;
2019-08-27 14:46:28 +03:00
struct btrfs_inode_ref * btrfs_find_name_in_backref ( struct extent_buffer * leaf ,
int slot , const char * name ,
int name_len ) ;
2019-08-27 14:46:29 +03:00
struct btrfs_inode_extref * btrfs_find_name_in_ext_backref (
struct extent_buffer * leaf , int slot , u64 ref_objectid ,
const char * name , int name_len ) ;
2007-03-26 16:00:06 -04:00
/* file-item.c */
2013-07-25 19:22:34 +08:00
struct btrfs_dio_private ;
2008-12-10 09:10:46 -05:00
int btrfs_del_csums ( struct btrfs_trans_handle * trans ,
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-05 16:58:30 +00:00
struct btrfs_root * root , u64 bytenr , u64 len ) ;
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
blk_status_t btrfs_lookup_bio_sums ( struct inode * inode , struct bio * bio , u8 * dst ) ;
2007-04-17 13:26:50 -04:00
int btrfs_insert_file_extent ( struct btrfs_trans_handle * trans ,
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
struct btrfs_root * root ,
u64 objectid , u64 pos ,
u64 disk_offset , u64 disk_num_bytes ,
u64 num_bytes , u64 offset , u64 ram_bytes ,
u8 compression , u8 encryption , u16 other_encoding ) ;
2007-03-26 16:00:06 -04:00
int btrfs_lookup_file_extent ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ,
2007-10-15 16:15:53 -04:00
u64 bytenr , int mod ) ;
2008-02-20 12:07:25 -05:00
int btrfs_csum_file_blocks ( struct btrfs_trans_handle * trans ,
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
struct btrfs_root * root ,
2008-07-17 12:53:50 -04:00
struct btrfs_ordered_sum * sums ) ;
2020-06-03 08:55:07 +03:00
blk_status_t btrfs_csum_one_bio ( struct btrfs_inode * inode , struct bio * bio ,
u64 file_start , int contig ) ;
2011-03-08 14:14:00 +01:00
int btrfs_lookup_csums_range ( struct btrfs_root * root , u64 start , u64 end ,
struct list_head * list , int search_commit ) ;
2017-02-20 13:51:02 +02:00
void btrfs_extent_item_to_extent_map ( struct btrfs_inode * inode ,
2014-06-09 03:48:05 +01:00
const struct btrfs_path * path ,
struct btrfs_file_extent_item * fi ,
const bool new_inline ,
struct extent_map * em ) ;
2020-01-17 09:02:21 -05:00
int btrfs_inode_clear_file_extent_range ( struct btrfs_inode * inode , u64 start ,
u64 len ) ;
int btrfs_inode_set_file_extent_range ( struct btrfs_inode * inode , u64 start ,
u64 len ) ;
2020-11-02 16:48:53 +02:00
void btrfs_inode_safe_disk_i_size_write ( struct btrfs_inode * inode , u64 new_i_size ) ;
2020-03-09 12:41:06 +00:00
u64 btrfs_file_extent_end ( const struct btrfs_path * path ) ;
2014-06-09 03:48:05 +01:00
2007-06-12 06:35:45 -04:00
/* inode.c */
2020-09-18 16:34:37 +03:00
blk_status_t btrfs_submit_data_bio ( struct inode * inode , struct bio * bio ,
int mirror_num , unsigned long bio_flags ) ;
2020-12-02 14:47:58 +08:00
int btrfs_verify_data_csum ( struct btrfs_io_bio * io_bio , u32 bio_offset ,
2020-09-18 16:34:33 +03:00
struct page * page , u64 start , u64 end , int mirror ) ;
2017-02-20 13:51:06 +02:00
struct extent_map * btrfs_get_extent_fiemap ( struct btrfs_inode * inode ,
2018-12-12 09:42:32 +02:00
u64 start , u64 len ) ;
2013-08-14 14:02:47 -04:00
noinline int can_nocow_extent ( struct inode * inode , u64 offset , u64 * len ,
2013-06-21 16:37:03 -04:00
u64 * orig_start , u64 * orig_block_len ,
2020-08-18 11:00:05 -07:00
u64 * ram_bytes , bool strict ) ;
2008-07-24 09:51:08 -04:00
2018-04-27 12:21:51 +03:00
void __btrfs_del_delalloc_inode ( struct btrfs_root * root ,
struct btrfs_inode * inode ) ;
2008-11-17 21:02:50 -05:00
struct inode * btrfs_lookup_dentry ( struct inode * dir , struct dentry * dentry ) ;
2017-02-20 13:50:35 +02:00
int btrfs_set_inode_index ( struct btrfs_inode * dir , u64 * index ) ;
2008-09-05 16:13:11 -04:00
int btrfs_unlink_inode ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
2017-01-18 00:31:44 +02:00
struct btrfs_inode * dir , struct btrfs_inode * inode ,
2008-09-05 16:13:11 -04:00
const char * name , int name_len ) ;
int btrfs_add_link ( struct btrfs_trans_handle * trans ,
2017-02-20 13:51:08 +02:00
struct btrfs_inode * parent_inode , struct btrfs_inode * inode ,
2008-09-05 16:13:11 -04:00
const char * name , int name_len , int add_backref , u64 index ) ;
2018-04-18 11:34:52 +09:00
int btrfs_delete_subvolume ( struct inode * dir , struct dentry * dentry ) ;
2020-11-02 16:49:03 +02:00
int btrfs_truncate_block ( struct btrfs_inode * inode , loff_t from , loff_t len ,
int front ) ;
2008-09-05 16:13:11 -04:00
int btrfs_truncate_inode_items ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
2020-11-02 16:48:55 +02:00
struct btrfs_inode * inode , u64 new_size ,
2008-09-05 16:13:11 -04:00
u32 min_type ) ;
2018-11-01 14:49:03 +08:00
int btrfs_start_delalloc_snapshot ( struct btrfs_root * root ) ;
2021-01-11 12:58:11 +02:00
int btrfs_start_delalloc_roots ( struct btrfs_fs_info * fs_info , long nr ,
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 11:55:58 +00:00
bool in_reclaim_context ) ;
2020-06-03 08:55:35 +03:00
int btrfs_set_extent_delalloc ( struct btrfs_inode * inode , u64 start , u64 end ,
2017-11-04 00:16:59 +00:00
unsigned int extra_bits ,
2019-07-17 16:18:17 +03:00
struct extent_state * * cached_state ) ;
2008-12-11 16:30:39 -05:00
int btrfs_create_subvol_root ( struct btrfs_trans_handle * trans ,
Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-07 11:47:46 +00:00
struct btrfs_root * new_root ,
2020-12-07 17:32:37 +02:00
struct btrfs_root * parent_root ) ;
2018-11-08 10:18:08 +02:00
void btrfs_set_delalloc_extent ( struct inode * inode , struct extent_state * state ,
2018-11-01 14:09:50 +02:00
unsigned * bits ) ;
2018-11-01 14:09:51 +02:00
void btrfs_clear_delalloc_extent ( struct inode * inode ,
struct extent_state * state , unsigned * bits ) ;
2018-11-01 14:09:52 +02:00
void btrfs_merge_delalloc_extent ( struct inode * inode , struct extent_state * new ,
struct extent_state * other ) ;
2018-11-01 14:09:53 +02:00
void btrfs_split_delalloc_extent ( struct inode * inode ,
struct extent_state * orig , u64 split ) ;
2018-11-27 20:57:58 +02:00
int btrfs_bio_fits_in_stripe ( struct page * page , size_t size , struct bio * bio ,
unsigned long bio_flags ) ;
2021-02-04 19:22:01 +09:00
bool btrfs_bio_fits_in_ordered_extent ( struct page * page , struct bio * bio ,
unsigned int size ) ;
2018-07-18 20:32:52 +02:00
void btrfs_set_range_writeback ( struct extent_io_tree * tree , u64 start , u64 end ) ;
2018-06-06 19:54:44 +05:30
vm_fault_t btrfs_page_mkwrite ( struct vm_fault * vmf ) ;
2007-06-15 13:50:00 -04:00
int btrfs_readpage ( struct file * file , struct page * page ) ;
2010-06-07 11:35:40 -04:00
void btrfs_evict_inode ( struct inode * inode ) ;
2010-03-05 09:21:37 +01:00
int btrfs_write_inode ( struct inode * inode , struct writeback_control * wbc ) ;
2007-06-12 06:35:45 -04:00
struct inode * btrfs_alloc_inode ( struct super_block * sb ) ;
void btrfs_destroy_inode ( struct inode * inode ) ;
2019-04-10 15:14:41 -04:00
void btrfs_free_inode ( struct inode * inode ) ;
2010-06-07 13:43:19 -04:00
int btrfs_drop_inode ( struct inode * inode ) ;
2017-11-02 17:21:50 -06:00
int __init btrfs_init_cachep ( void ) ;
2018-02-19 17:24:18 +01:00
void __cold btrfs_destroy_cachep ( void ) ;
2020-05-15 19:35:59 +02:00
struct inode * btrfs_iget_path ( struct super_block * s , u64 ino ,
2019-10-03 19:09:35 +02:00
struct btrfs_root * root , struct btrfs_path * path ) ;
2020-05-15 19:35:59 +02:00
struct inode * btrfs_iget ( struct super_block * s , u64 ino , struct btrfs_root * root ) ;
2017-02-20 13:51:06 +02:00
struct extent_map * btrfs_get_extent ( struct btrfs_inode * inode ,
2018-08-25 13:47:59 +08:00
struct page * page , size_t pg_offset ,
2019-12-02 17:34:23 -08:00
u64 start , u64 end ) ;
2007-08-27 16:49:44 -04:00
int btrfs_update_inode ( struct btrfs_trans_handle * trans ,
2020-11-02 16:48:59 +02:00
struct btrfs_root * root , struct btrfs_inode * inode ) ;
2012-10-22 15:43:12 -04:00
int btrfs_update_inode_fallback ( struct btrfs_trans_handle * trans ,
2020-11-02 16:49:06 +02:00
struct btrfs_root * root , struct btrfs_inode * inode ) ;
2017-02-20 13:50:59 +02:00
int btrfs_orphan_add ( struct btrfs_trans_handle * trans ,
struct btrfs_inode * inode ) ;
2011-01-31 16:22:42 -05:00
int btrfs_orphan_cleanup ( struct btrfs_root * root ) ;
2020-11-02 16:49:04 +02:00
int btrfs_cont_expand ( struct btrfs_inode * inode , loff_t oldsize , loff_t size ) ;
2009-11-12 09:36:34 +00:00
void btrfs_add_delayed_iput ( struct inode * inode ) ;
2016-06-22 18:54:24 -04:00
void btrfs_run_delayed_iputs ( struct btrfs_fs_info * fs_info ) ;
2018-12-03 11:06:52 -05:00
int btrfs_wait_on_delayed_iputs ( struct btrfs_fs_info * fs_info ) ;
2010-05-16 10:49:59 -04:00
int btrfs_prealloc_file_range ( struct inode * inode , int mode ,
u64 start , u64 num_bytes , u64 min_size ,
loff_t actual_len , u64 * alloc_hint ) ;
2010-06-21 14:48:16 -04:00
int btrfs_prealloc_file_range_trans ( struct inode * inode ,
struct btrfs_trans_handle * trans , int mode ,
u64 start , u64 num_bytes , u64 min_size ,
loff_t actual_len , u64 * alloc_hint ) ;
2020-06-03 08:55:29 +03:00
int btrfs_run_delalloc_range ( struct btrfs_inode * inode , struct page * locked_page ,
2018-11-01 14:09:46 +02:00
u64 start , u64 end , int * page_started , unsigned long * nr_written ,
struct writeback_control * wbc ) ;
2018-11-01 14:09:47 +02:00
int btrfs_writepage_cow_fixup ( struct page * page , u64 start , u64 end ) ;
2018-11-01 14:09:48 +02:00
void btrfs_writepage_endio_finish_ordered ( struct page * page , u64 start ,
2018-11-08 10:18:08 +02:00
u64 end , int uptodate ) ;
2009-10-09 09:54:36 -04:00
extern const struct dentry_operations btrfs_dentry_operations ;
2020-09-24 11:39:12 -05:00
extern const struct iomap_ops btrfs_dio_iomap_ops ;
extern const struct iomap_dio_ops btrfs_dio_ops ;
2008-06-11 21:53:53 -04:00
2020-09-24 11:39:16 -05:00
/* Inode locking type flags, by default the exclusive lock is taken */
# define BTRFS_ILOCK_SHARED (1U << 0)
# define BTRFS_ILOCK_TRY (1U << 1)
int btrfs_inode_lock ( struct inode * inode , unsigned int ilock_flags ) ;
void btrfs_inode_unlock ( struct inode * inode , unsigned int ilock_flags ) ;
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 11:07:34 +00:00
void btrfs_update_inode_bytes ( struct btrfs_inode * inode ,
const u64 add_bytes ,
const u64 del_bytes ) ;
2008-06-11 21:53:53 -04:00
/* ioctl.c */
long btrfs_ioctl ( struct file * file , unsigned int cmd , unsigned long arg ) ;
2015-10-29 08:22:21 +00:00
long btrfs_compat_ioctl ( struct file * file , unsigned int cmd , unsigned long arg ) ;
2016-02-17 15:26:27 +01:00
int btrfs_ioctl_get_supported_features ( void __user * arg ) ;
2018-03-26 18:40:21 +02:00
void btrfs_sync_inode_flags_to_i_flags ( struct inode * inode ) ;
2019-10-01 19:57:39 +02:00
int __pure btrfs_is_empty_uuid ( u8 * uuid ) ;
2011-05-24 15:35:30 -04:00
int btrfs_defrag_file ( struct inode * inode , struct file * file ,
struct btrfs_ioctl_defrag_range_args * range ,
u64 newer_than , unsigned long max_pages ) ;
2018-03-21 02:05:27 +01:00
void btrfs_get_block_group_info ( struct list_head * groups_list ,
struct btrfs_ioctl_space_info * space ) ;
void btrfs_update_ioctl_balance_args ( struct btrfs_fs_info * fs_info ,
2013-08-14 18:12:25 +02:00
struct btrfs_ioctl_balance_args * bargs ) ;
2020-08-25 10:02:32 -05:00
bool btrfs_exclop_start ( struct btrfs_fs_info * fs_info ,
enum btrfs_exclusive_operation type ) ;
void btrfs_exclop_finish ( struct btrfs_fs_info * fs_info ) ;
2013-08-14 18:12:25 +02:00
2007-06-12 06:35:45 -04:00
/* file.c */
2017-11-02 17:21:50 -06:00
int __init btrfs_auto_defrag_init ( void ) ;
2018-02-19 17:24:18 +01:00
void __cold btrfs_auto_defrag_exit ( void ) ;
2011-05-24 15:35:30 -04:00
int btrfs_add_inode_defrag ( struct btrfs_trans_handle * trans ,
2017-02-20 13:50:43 +02:00
struct btrfs_inode * inode ) ;
2011-05-24 15:35:30 -04:00
int btrfs_run_defrag_inodes ( struct btrfs_fs_info * fs_info ) ;
2012-11-26 09:26:20 +00:00
void btrfs_cleanup_defrag_inodes ( struct btrfs_fs_info * fs_info ) ;
2011-07-16 20:44:56 -04:00
int btrfs_sync_file ( struct file * file , loff_t start , loff_t end , int datasync ) ;
2017-02-20 13:50:45 +02:00
void btrfs_drop_extent_cache ( struct btrfs_inode * inode , u64 start , u64 end ,
2012-08-30 20:06:49 -04:00
int skip_pinned ) ;
2009-10-01 15:43:56 -07:00
extern const struct file_operations btrfs_file_operations ;
Btrfs: turbo charge fsync
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 13:14:17 -04:00
int btrfs_drop_extents ( struct btrfs_trans_handle * trans ,
2020-11-04 11:07:32 +00:00
struct btrfs_root * root , struct btrfs_inode * inode ,
struct btrfs_drop_extents_args * args ) ;
2021-02-17 15:12:47 +02:00
int btrfs_replace_file_extents ( struct btrfs_inode * inode ,
struct btrfs_path * path , const u64 start ,
const u64 end ,
2020-09-08 11:27:22 +01:00
struct btrfs_replace_extent_info * extent_info ,
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 11:09:50 +01:00
struct btrfs_trans_handle * * trans_out ) ;
2008-10-30 14:25:28 -04:00
int btrfs_mark_extent_written ( struct btrfs_trans_handle * trans ,
2017-02-20 13:50:48 +02:00
struct btrfs_inode * inode , u64 start , u64 end ) ;
2008-06-10 10:07:39 -04:00
int btrfs_release_file ( struct inode * inode , struct file * file ) ;
2020-06-03 08:55:36 +03:00
int btrfs_dirty_pages ( struct btrfs_inode * inode , struct page * * pages ,
2016-06-22 18:54:24 -04:00
size_t num_pages , loff_t pos , size_t write_bytes ,
2020-10-14 09:55:45 -05:00
struct extent_state * * cached , bool noreserve ) ;
2014-10-10 09:43:11 +01:00
int btrfs_fdatawrite_range ( struct inode * inode , loff_t start , loff_t end ) ;
2020-06-24 07:23:52 +08:00
int btrfs_check_nocow_lock ( struct btrfs_inode * inode , loff_t pos ,
size_t * write_bytes ) ;
void btrfs_check_nocow_unlock ( struct btrfs_inode * inode ) ;
2008-06-10 10:07:39 -04:00
2007-08-07 16:15:09 -04:00
/* tree-defrag.c */
int btrfs_defrag_leaves ( struct btrfs_trans_handle * trans ,
2013-01-31 18:21:12 +00:00
struct btrfs_root * root ) ;
2007-08-29 15:47:34 -04:00
2007-12-21 16:27:24 -05:00
/* super.c */
2016-06-22 18:54:24 -04:00
int btrfs_parse_options ( struct btrfs_fs_info * info , char * options ,
2016-01-19 10:23:03 +08:00
unsigned long new_flags ) ;
2008-06-10 10:07:39 -04:00
int btrfs_sync_fs ( struct super_block * sb , int wait ) ;
2020-02-21 14:56:12 +01:00
char * btrfs_get_subvol_name_from_objectid ( struct btrfs_fs_info * fs_info ,
u64 subvol_objectid ) ;
2012-07-30 14:40:13 -07:00
2018-02-19 17:24:18 +01:00
static inline __printf ( 2 , 3 ) __cold
2016-09-23 18:05:21 +02:00
void btrfs_no_printk ( const struct btrfs_fs_info * fs_info , const char * fmt , . . . )
{
}
2012-07-30 14:40:13 -07:00
# ifdef CONFIG_PRINTK
__printf ( 2 , 3 )
2018-02-19 17:24:18 +01:00
__cold
2013-03-19 22:41:23 +00:00
void btrfs_printk ( const struct btrfs_fs_info * fs_info , const char * fmt , . . . ) ;
2012-07-30 14:40:13 -07:00
# else
2016-09-23 18:05:21 +02:00
# define btrfs_printk(fs_info, fmt, args...) \
btrfs_no_printk ( fs_info , fmt , # # args )
2012-07-30 14:40:13 -07:00
# endif
2013-03-19 22:41:23 +00:00
# define btrfs_emerg(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_INFO fmt , # # args )
2013-11-12 19:22:53 -05:00
2015-10-08 08:48:52 +02:00
/*
* Wrappers that use printk_in_rcu
*/
# define btrfs_emerg_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_INFO fmt , # # args )
2015-10-08 10:27:02 +02:00
/*
* Wrappers that use a ratelimited printk_in_rcu
*/
# define btrfs_emerg_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_INFO fmt , # # args )
2015-10-08 10:51:11 +02:00
/*
* Wrappers that use a ratelimited printk
*/
# define btrfs_emerg_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_EMERG fmt , # # args )
# define btrfs_alert_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_ALERT fmt , # # args )
# define btrfs_crit_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_CRIT fmt , # # args )
# define btrfs_err_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_ERR fmt , # # args )
# define btrfs_warn_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_WARNING fmt , # # args )
# define btrfs_notice_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_NOTICE fmt , # # args )
# define btrfs_info_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_INFO fmt , # # args )
2016-08-31 23:55:33 -04:00
# if defined(CONFIG_DYNAMIC_DEBUG)
# define btrfs_debug(fs_info, fmt, args...) \
2019-03-07 16:28:00 -08:00
_dynamic_func_call_no_desc ( fmt , btrfs_printk , \
fs_info , KERN_DEBUG fmt , # # args )
# define btrfs_debug_in_rcu(fs_info, fmt, args...) \
_dynamic_func_call_no_desc ( fmt , btrfs_printk_in_rcu , \
fs_info , KERN_DEBUG fmt , # # args )
2016-08-31 23:55:33 -04:00
# define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
2019-03-07 16:28:00 -08:00
_dynamic_func_call_no_desc ( fmt , btrfs_printk_rl_in_rcu , \
fs_info , KERN_DEBUG fmt , # # args )
# define btrfs_debug_rl(fs_info, fmt, args...) \
_dynamic_func_call_no_desc ( fmt , btrfs_printk_ratelimited , \
fs_info , KERN_DEBUG fmt , # # args )
2016-08-31 23:55:33 -04:00
# elif defined(DEBUG)
2013-03-19 22:41:23 +00:00
# define btrfs_debug(fs_info, fmt, args...) \
btrfs_printk ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 08:48:52 +02:00
# define btrfs_debug_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 10:27:02 +02:00
# define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
btrfs_printk_rl_in_rcu ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 10:51:11 +02:00
# define btrfs_debug_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited ( fs_info , KERN_DEBUG fmt , # # args )
2013-11-12 19:22:53 -05:00
# else
# define btrfs_debug(fs_info, fmt, args...) \
2016-09-21 12:17:37 -04:00
btrfs_no_printk ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 08:48:52 +02:00
# define btrfs_debug_in_rcu(fs_info, fmt, args...) \
2018-08-24 11:35:28 +09:00
btrfs_no_printk_in_rcu ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 10:27:02 +02:00
# define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
2018-08-24 11:35:28 +09:00
btrfs_no_printk_in_rcu ( fs_info , KERN_DEBUG fmt , # # args )
2015-10-08 10:51:11 +02:00
# define btrfs_debug_rl(fs_info, fmt, args...) \
2016-09-21 12:17:37 -04:00
btrfs_no_printk ( fs_info , KERN_DEBUG fmt , # # args )
2013-11-12 19:22:53 -05:00
# endif
2013-03-19 22:41:23 +00:00
2015-10-08 08:48:52 +02:00
# define btrfs_printk_in_rcu(fs_info, fmt, args...) \
do { \
rcu_read_lock ( ) ; \
btrfs_printk ( fs_info , fmt , # # args ) ; \
2018-08-24 11:35:28 +09:00
rcu_read_unlock ( ) ; \
} while ( 0 )
# define btrfs_no_printk_in_rcu(fs_info, fmt, args...) \
do { \
rcu_read_lock ( ) ; \
btrfs_no_printk ( fs_info , fmt , # # args ) ; \
2015-10-08 08:48:52 +02:00
rcu_read_unlock ( ) ; \
} while ( 0 )
2015-10-08 10:27:02 +02:00
# define btrfs_printk_ratelimited(fs_info, fmt, args...) \
do { \
static DEFINE_RATELIMIT_STATE ( _rs , \
DEFAULT_RATELIMIT_INTERVAL , \
DEFAULT_RATELIMIT_BURST ) ; \
if ( __ratelimit ( & _rs ) ) \
btrfs_printk ( fs_info , fmt , # # args ) ; \
} while ( 0 )
# define btrfs_printk_rl_in_rcu(fs_info, fmt, args...) \
do { \
rcu_read_lock ( ) ; \
btrfs_printk_ratelimited ( fs_info , fmt , # # args ) ; \
rcu_read_unlock ( ) ; \
} while ( 0 )
2019-12-16 20:00:48 +01:00
# ifdef CONFIG_BTRFS_ASSERT
__cold __noreturn
static inline void assertfail ( const char * expr , const char * file , int line )
2013-08-26 16:53:15 -04:00
{
2019-12-16 20:00:48 +01:00
pr_err ( " assertion failed: %s, in %s:%d \n " , expr , file , line ) ;
BUG ( ) ;
2013-08-26 16:53:15 -04:00
}
2019-12-16 20:00:48 +01:00
# define ASSERT(expr) \
( likely ( expr ) ? ( void ) 0 : assertfail ( # expr , __FILE__ , __LINE__ ) )
# else
static inline void assertfail ( const char * expr , const char * file , int line ) { }
# define ASSERT(expr) (void)(expr)
# endif
2013-08-26 16:53:15 -04:00
2020-12-02 14:48:04 +08:00
/*
* Get the correct offset inside the page of extent buffer .
*
* @ eb : target extent buffer
* @ start : offset inside the extent buffer
*
* Will handle both sectorsize = = PAGE_SIZE and sectorsize < PAGE_SIZE cases .
*/
static inline size_t get_eb_offset_in_page ( const struct extent_buffer * eb ,
unsigned long offset )
{
/*
* For sectorsize = = PAGE_SIZE case , eb - > start will always be aligned
* to PAGE_SIZE , thus adding it won ' t cause any difference .
*
* For sectorsize < PAGE_SIZE , we must only read the data that belongs
* to the eb , thus we have to take the eb - > start into consideration .
*/
return offset_in_page ( offset + eb - > start ) ;
}
static inline unsigned long get_eb_page_index ( unsigned long offset )
{
/*
* For sectorsize = = PAGE_SIZE case , plain > > PAGE_SHIFT is enough .
*
* For sectorsize < PAGE_SIZE case , we only support 64 K PAGE_SIZE ,
* and have ensured that all tree blocks are contained in one page ,
* thus we always get index = = 0.
*/
return offset > > PAGE_SHIFT ;
}
2018-11-19 10:38:16 +01:00
/*
* Use that for functions that are conditionally exported for sanity tests but
* otherwise static
*/
# ifndef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
# define EXPORT_FOR_TESTS static
# else
# define EXPORT_FOR_TESTS
# endif
2018-06-26 16:57:36 +03:00
__cold
static inline void btrfs_print_v0_err ( struct btrfs_fs_info * fs_info )
{
btrfs_err ( fs_info ,
" Unsupported V0 extent filesystem detected. Aborting. Please re-create your filesystem with a newer kernel " ) ;
}
2012-07-30 14:40:13 -07:00
__printf ( 5 , 6 )
2015-04-24 19:11:57 +02:00
__cold
2016-03-16 16:43:06 +08:00
void __btrfs_handle_fs_error ( struct btrfs_fs_info * fs_info , const char * function ,
2012-03-01 14:57:30 +01:00
unsigned int line , int errno , const char * fmt , . . . ) ;
2011-01-06 19:30:25 +08:00
2019-10-01 19:57:37 +02:00
const char * __attribute_const__ btrfs_decode_error ( int errno ) ;
2012-07-30 14:40:13 -07:00
2015-04-24 19:11:57 +02:00
__cold
2012-03-01 17:24:58 +01:00
void __btrfs_abort_transaction ( struct btrfs_trans_handle * trans ,
2016-06-10 18:19:25 -04:00
const char * function ,
2012-03-01 17:24:58 +01:00
unsigned int line , int errno ) ;
2016-03-16 16:43:08 +08:00
/*
* Call btrfs_abort_transaction as early as possible when an error condition is
* detected , that way the exact line number is reported .
*/
2016-06-10 18:19:25 -04:00
# define btrfs_abort_transaction(trans, errno) \
2016-03-16 16:43:08 +08:00
do { \
/* Report first abort since mount */ \
if ( ! test_and_set_bit ( BTRFS_FS_STATE_TRANS_ABORTED , \
2016-06-10 18:19:25 -04:00
& ( ( trans ) - > fs_info - > fs_state ) ) ) { \
2020-07-21 11:24:27 -04:00
if ( ( errno ) ! = - EIO & & ( errno ) ! = - EROFS ) { \
2016-12-09 05:56:33 -08:00
WARN ( 1 , KERN_DEBUG \
" BTRFS: Transaction aborted (error %d) \n " , \
( errno ) ) ; \
} else { \
2017-02-15 16:28:34 -05:00
btrfs_debug ( ( trans ) - > fs_info , \
" Transaction aborted (error %d) " , \
2016-12-09 05:56:33 -08:00
( errno ) ) ; \
} \
2016-03-16 16:43:08 +08:00
} \
2016-06-10 18:19:25 -04:00
__btrfs_abort_transaction ( ( trans ) , __func__ , \
2016-03-16 16:43:08 +08:00
__LINE__ , ( errno ) ) ; \
} while ( 0 )
# define btrfs_handle_fs_error(fs_info, errno, fmt, args...) \
do { \
__btrfs_handle_fs_error ( ( fs_info ) , __func__ , __LINE__ , \
( errno ) , fmt , # # args ) ; \
} while ( 0 )
__printf ( 5 , 6 )
__cold
void __btrfs_panic ( struct btrfs_fs_info * fs_info , const char * function ,
unsigned int line , int errno , const char * fmt , . . . ) ;
/*
* If BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is in mount_opt , __btrfs_panic
* will panic ( ) . Otherwise we BUG ( ) here .
*/
# define btrfs_panic(fs_info, errno, fmt, args...) \
do { \
__btrfs_panic ( fs_info , __func__ , __LINE__ , errno , fmt , # # args ) ; \
BUG ( ) ; \
} while ( 0 )
/* compatibility and incompatibility defines */
2012-07-24 11:58:43 -06:00
# define btrfs_set_fs_incompat(__fs_info, opt) \
2019-06-13 17:55:03 +02:00
__btrfs_set_fs_incompat ( ( __fs_info ) , BTRFS_FEATURE_INCOMPAT_ # # opt , \
# opt)
2012-07-24 11:58:43 -06:00
static inline void __btrfs_set_fs_incompat ( struct btrfs_fs_info * fs_info ,
2019-06-13 17:55:03 +02:00
u64 flag , const char * name )
2012-07-24 11:58:43 -06:00
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
2013-04-11 10:30:16 +00:00
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
features | = flag ;
btrfs_set_super_incompat_flags ( disk_super , features ) ;
2019-06-13 17:55:03 +02:00
btrfs_info ( fs_info ,
" setting incompat feature flag for %s (0x%llx) " ,
name , flag ) ;
2013-04-11 10:30:16 +00:00
}
spin_unlock ( & fs_info - > super_lock ) ;
2012-07-24 11:58:43 -06:00
}
}
2015-09-29 20:50:32 -07:00
# define btrfs_clear_fs_incompat(__fs_info, opt) \
2019-06-13 17:55:03 +02:00
__btrfs_clear_fs_incompat ( ( __fs_info ) , BTRFS_FEATURE_INCOMPAT_ # # opt , \
# opt)
2015-09-29 20:50:32 -07:00
static inline void __btrfs_clear_fs_incompat ( struct btrfs_fs_info * fs_info ,
2019-06-13 17:55:03 +02:00
u64 flag , const char * name )
2015-09-29 20:50:32 -07:00
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( features & flag ) {
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_incompat_flags ( disk_super ) ;
if ( features & flag ) {
features & = ~ flag ;
btrfs_set_super_incompat_flags ( disk_super , features ) ;
2019-06-13 17:55:03 +02:00
btrfs_info ( fs_info ,
" clearing incompat feature flag for %s (0x%llx) " ,
name , flag ) ;
2015-09-29 20:50:32 -07:00
}
spin_unlock ( & fs_info - > super_lock ) ;
}
}
2013-03-07 14:22:04 -05:00
# define btrfs_fs_incompat(fs_info, opt) \
__btrfs_fs_incompat ( ( fs_info ) , BTRFS_FEATURE_INCOMPAT_ # # opt )
2015-10-18 21:35:41 +00:00
static inline bool __btrfs_fs_incompat ( struct btrfs_fs_info * fs_info , u64 flag )
2013-03-07 14:22:04 -05:00
{
struct btrfs_super_block * disk_super ;
disk_super = fs_info - > super_copy ;
return ! ! ( btrfs_super_incompat_flags ( disk_super ) & flag ) ;
}
2015-09-29 20:50:32 -07:00
# define btrfs_set_fs_compat_ro(__fs_info, opt) \
2019-06-13 17:55:03 +02:00
__btrfs_set_fs_compat_ro ( ( __fs_info ) , BTRFS_FEATURE_COMPAT_RO_ # # opt , \
# opt)
2015-09-29 20:50:32 -07:00
static inline void __btrfs_set_fs_compat_ro ( struct btrfs_fs_info * fs_info ,
2019-06-13 17:55:03 +02:00
u64 flag , const char * name )
2015-09-29 20:50:32 -07:00
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( ! ( features & flag ) ) {
features | = flag ;
btrfs_set_super_compat_ro_flags ( disk_super , features ) ;
2019-06-13 17:55:03 +02:00
btrfs_info ( fs_info ,
" setting compat-ro feature flag for %s (0x%llx) " ,
name , flag ) ;
2015-09-29 20:50:32 -07:00
}
spin_unlock ( & fs_info - > super_lock ) ;
}
}
# define btrfs_clear_fs_compat_ro(__fs_info, opt) \
2019-06-13 17:55:03 +02:00
__btrfs_clear_fs_compat_ro ( ( __fs_info ) , BTRFS_FEATURE_COMPAT_RO_ # # opt , \
# opt)
2015-09-29 20:50:32 -07:00
static inline void __btrfs_clear_fs_compat_ro ( struct btrfs_fs_info * fs_info ,
2019-06-13 17:55:03 +02:00
u64 flag , const char * name )
2015-09-29 20:50:32 -07:00
{
struct btrfs_super_block * disk_super ;
u64 features ;
disk_super = fs_info - > super_copy ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( features & flag ) {
spin_lock ( & fs_info - > super_lock ) ;
features = btrfs_super_compat_ro_flags ( disk_super ) ;
if ( features & flag ) {
features & = ~ flag ;
btrfs_set_super_compat_ro_flags ( disk_super , features ) ;
2019-06-13 17:55:03 +02:00
btrfs_info ( fs_info ,
" clearing compat-ro feature flag for %s (0x%llx) " ,
name , flag ) ;
2015-09-29 20:50:32 -07:00
}
spin_unlock ( & fs_info - > super_lock ) ;
}
}
# define btrfs_fs_compat_ro(fs_info, opt) \
__btrfs_fs_compat_ro ( ( fs_info ) , BTRFS_FEATURE_COMPAT_RO_ # # opt )
static inline int __btrfs_fs_compat_ro ( struct btrfs_fs_info * fs_info , u64 flag )
{
struct btrfs_super_block * disk_super ;
disk_super = fs_info - > super_copy ;
return ! ! ( btrfs_super_compat_ro_flags ( disk_super ) & flag ) ;
}
2008-07-24 12:16:36 -04:00
/* acl.c */
2009-10-13 13:50:18 -04:00
# ifdef CONFIG_BTRFS_FS_POSIX_ACL
2011-07-23 17:37:31 +02:00
struct posix_acl * btrfs_get_acl ( struct inode * inode , int type ) ;
2021-01-21 14:19:43 +01:00
int btrfs_set_acl ( struct user_namespace * mnt_userns , struct inode * inode ,
struct posix_acl * acl , int type ) ;
2009-11-12 09:35:27 +00:00
int btrfs_init_acl ( struct btrfs_trans_handle * trans ,
struct inode * inode , struct inode * dir ) ;
2011-07-14 03:17:39 +00:00
# else
2011-08-02 21:14:05 -10:00
# define btrfs_get_acl NULL
2013-12-20 05:16:43 -08:00
# define btrfs_set_acl NULL
2011-07-14 03:17:39 +00:00
static inline int btrfs_init_acl ( struct btrfs_trans_handle * trans ,
struct inode * inode , struct inode * dir )
{
return 0 ;
}
# endif
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
/* relocation.c */
2016-06-21 21:16:51 -04:00
int btrfs_relocate_block_group ( struct btrfs_fs_info * fs_info , u64 group_start ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
int btrfs_init_reloc_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
int btrfs_update_reloc_root ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ) ;
int btrfs_recover_relocation ( struct btrfs_root * root ) ;
2020-06-03 08:55:04 +03:00
int btrfs_reloc_clone_csums ( struct btrfs_inode * inode , u64 file_pos , u64 len ) ;
2013-08-30 15:09:51 -04:00
int btrfs_reloc_cow_block ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct extent_buffer * buf ,
struct extent_buffer * cow ) ;
2015-08-06 20:58:11 +08:00
void btrfs_reloc_pre_snapshot ( struct btrfs_pending_snapshot * pending ,
2010-05-16 10:49:59 -04:00
u64 * bytes_to_reserve ) ;
2012-03-01 17:24:58 +01:00
int btrfs_reloc_post_snapshot ( struct btrfs_trans_handle * trans ,
2010-05-16 10:49:59 -04:00
struct btrfs_pending_snapshot * pending ) ;
2020-02-17 14:16:52 +08:00
int btrfs_should_cancel_balance ( struct btrfs_fs_info * fs_info ) ;
2020-03-06 14:04:12 +08:00
struct btrfs_root * find_reloc_root ( struct btrfs_fs_info * fs_info ,
u64 bytenr ) ;
2020-03-03 14:26:02 +08:00
int btrfs_should_ignore_reloc_root ( struct btrfs_root * root ) ;
2011-03-08 14:14:00 +01:00
/* scrub.c */
2012-11-05 17:03:39 +01:00
int btrfs_scrub_dev ( struct btrfs_fs_info * fs_info , u64 devid , u64 start ,
u64 end , struct btrfs_scrub_progress * progress ,
2012-11-05 18:29:28 +01:00
int readonly , int is_dev_replace ) ;
2016-06-22 18:54:24 -04:00
void btrfs_scrub_pause ( struct btrfs_fs_info * fs_info ) ;
void btrfs_scrub_continue ( struct btrfs_fs_info * fs_info ) ;
2012-11-05 17:03:39 +01:00
int btrfs_scrub_cancel ( struct btrfs_fs_info * info ) ;
2019-03-20 16:32:55 +01:00
int btrfs_scrub_cancel_dev ( struct btrfs_device * dev ) ;
2016-06-22 18:54:24 -04:00
int btrfs_scrub_progress ( struct btrfs_fs_info * fs_info , u64 devid ,
2011-03-08 14:14:00 +01:00
struct btrfs_scrub_progress * progress ) ;
2017-04-14 08:35:54 +08:00
static inline void btrfs_init_full_stripe_locks_tree (
struct btrfs_full_stripe_locks_tree * locks_root )
{
locks_root - > root = RB_ROOT ;
mutex_init ( & locks_root - > lock ) ;
}
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
/* dev-replace.c */
void btrfs_bio_counter_inc_blocked ( struct btrfs_fs_info * fs_info ) ;
void btrfs_bio_counter_inc_noblocked ( struct btrfs_fs_info * fs_info ) ;
2014-11-25 16:39:28 +08:00
void btrfs_bio_counter_sub ( struct btrfs_fs_info * fs_info , s64 amount ) ;
static inline void btrfs_bio_counter_dec ( struct btrfs_fs_info * fs_info )
{
btrfs_bio_counter_sub ( fs_info , 1 ) ;
}
2011-03-08 14:14:00 +01:00
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 14:33:49 +02:00
/* reada.c */
struct reada_control {
2016-06-22 18:56:44 -04:00
struct btrfs_fs_info * fs_info ; /* tree to prefetch */
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 14:33:49 +02:00
struct btrfs_key key_start ;
struct btrfs_key key_end ; /* exclusive */
atomic_t elems ;
struct kref refcnt ;
wait_queue_head_t wait ;
} ;
struct reada_control * btrfs_reada_add ( struct btrfs_root * root ,
struct btrfs_key * start , struct btrfs_key * end ) ;
int btrfs_reada_wait ( void * handle ) ;
void btrfs_reada_detach ( void * handle ) ;
2017-03-02 19:43:30 +01:00
int btree_readahead_hook ( struct extent_buffer * eb , int err ) ;
btrfs: fix readahead hang and use-after-free after removing a device
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:
[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.
After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:
>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156
Which corresponds to:
int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;
while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)
So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:
$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()
for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)
$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$
So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.
Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:
$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'
return ret
def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)
$ drgn dump_block_groups.py
block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6
block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6
block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6
block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6
block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6
block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6
block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$
So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.
So after figuring that out it became obvious why the hang happens:
1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;
2) Task B starts a device replace with /dev/sdd as the source device;
3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;
4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.
It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.
As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;
4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;
5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().
Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().
At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.
This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.
This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.
For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:
1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;
2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;
3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;
4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();
5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().
And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.
While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.
So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.
This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-12 11:55:24 +01:00
void btrfs_reada_remove_dev ( struct btrfs_device * dev ) ;
void btrfs_reada_undo_remove_dev ( struct btrfs_device * dev ) ;
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 14:33:49 +02:00
2012-05-29 17:06:54 +02:00
static inline int is_fstree ( u64 rootid )
{
if ( rootid = = BTRFS_FS_TREE_OBJECTID | |
2015-02-27 16:24:23 +08:00
( ( s64 ) rootid > = ( s64 ) BTRFS_FIRST_FREE_OBJECTID & &
! btrfs_qgroup_level ( rootid ) ) )
2012-05-29 17:06:54 +02:00
return 1 ;
return 0 ;
}
2013-02-09 23:38:06 +00:00
static inline int btrfs_defrag_cancelled ( struct btrfs_fs_info * fs_info )
{
return signal_pending ( current ) ;
}
2019-03-27 14:24:15 +02:00
# define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len))
2013-10-11 14:44:09 -04:00
/* Sanity test specific functions */
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
void btrfs_test_destroy_inode ( struct inode * inode ) ;
2016-06-21 09:52:41 -04:00
static inline int btrfs_is_testing ( struct btrfs_fs_info * fs_info )
2014-09-29 23:53:21 +02:00
{
2018-08-17 17:48:13 +02:00
return test_bit ( BTRFS_FS_STATE_DUMMY_FS_INFO , & fs_info - > fs_state ) ;
}
# else
static inline int btrfs_is_testing ( struct btrfs_fs_info * fs_info )
{
2014-09-29 23:53:21 +02:00
return 0 ;
}
2018-08-17 17:48:13 +02:00
# endif
2018-04-03 19:16:55 +02:00
2020-11-10 20:26:08 +09:00
static inline bool btrfs_is_zoned ( const struct btrfs_fs_info * fs_info )
{
return fs_info - > zoned ! = 0 ;
}
2007-02-02 09:18:22 -05:00
# endif