2018-04-03 19:23:33 +02:00
// SPDX-License-Identifier: GPL-2.0
2007-06-12 09:07:21 -04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*/
2007-03-22 12:13:20 -04:00
# include <linux/fs.h>
2007-03-28 13:57:48 -04:00
# include <linux/blkdev.h>
2022-07-15 13:59:21 +02:00
# include <linux/radix-tree.h>
2007-05-02 15:53:43 -04:00
# include <linux/writeback.h>
2008-04-09 16:28:12 -04:00
# include <linux/workqueue.h>
2008-06-25 16:01:31 -04:00
# include <linux/kthread.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2010-11-21 22:20:49 -05:00
# include <linux/migrate.h>
2011-05-06 15:33:15 +02:00
# include <linux/ratelimit.h>
2013-04-19 15:08:05 +00:00
# include <linux/uuid.h>
2013-08-15 17:11:21 +02:00
# include <linux/semaphore.h>
2018-01-13 02:55:03 +09:00
# include <linux/error-injection.h>
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 11:45:05 +02:00
# include <linux/crc32c.h>
2018-12-13 21:16:45 +00:00
# include <linux/sched/mm.h>
2011-03-18 22:56:43 +00:00
# include <asm/unaligned.h>
2019-06-03 16:58:56 +02:00
# include <crypto/hash.h>
2007-02-02 09:18:22 -05:00
# include "ctree.h"
# include "disk-io.h"
2007-03-16 16:20:31 -04:00
# include "transaction.h"
2007-04-09 10:42:37 -04:00
# include "btrfs_inode.h"
2022-11-15 10:44:05 +01:00
# include "bio.h"
2007-10-15 16:15:53 -04:00
# include "print-tree.h"
2008-06-25 16:01:30 -04:00
# include "locking.h"
2008-09-05 16:13:11 -04:00
# include "tree-log.h"
2009-04-03 09:47:43 -04:00
# include "free-space-cache.h"
2015-09-29 20:50:38 -07:00
# include "free-space-tree.h"
2012-11-06 13:15:27 +01:00
# include "dev-replace.h"
2013-01-29 18:40:14 -05:00
# include "raid56.h"
2013-11-01 13:06:58 -04:00
# include "sysfs.h"
2014-05-13 17:30:47 -07:00
# include "qgroup.h"
2016-03-10 17:26:59 +08:00
# include "compression.h"
2017-10-09 01:51:02 +00:00
# include "tree-checker.h"
2017-09-29 15:43:50 -04:00
# include "ref-verify.h"
2019-06-20 15:37:44 -04:00
# include "block-group.h"
2019-12-13 16:22:14 -08:00
# include "discard.h"
2020-01-20 16:09:08 +02:00
# include "space-info.h"
2020-11-10 20:26:08 +09:00
# include "zoned.h"
2021-03-25 15:14:39 +08:00
# include "subpage.h"
2022-10-19 10:50:47 -04:00
# include "fs.h"
2022-10-19 10:51:00 -04:00
# include "accessors.h"
2022-10-24 14:46:57 -04:00
# include "extent-tree.h"
2022-10-24 14:47:00 -04:00
# include "root-tree.h"
2022-10-26 15:08:25 -04:00
# include "defrag.h"
2022-10-26 15:08:28 -04:00
# include "uuid-tree.h"
2022-10-26 15:08:34 -04:00
# include "relocation.h"
2022-10-26 15:08:35 -04:00
# include "scrub.h"
2022-10-26 15:08:39 -04:00
# include "super.h"
2007-02-02 09:18:22 -05:00
2015-12-15 09:14:36 +08:00
# define BTRFS_SUPER_FLAG_SUPP (BTRFS_HEADER_FLAG_WRITTEN |\
BTRFS_HEADER_FLAG_RELOC | \
BTRFS_SUPER_FLAG_ERROR | \
BTRFS_SUPER_FLAG_SEEDING | \
2018-01-09 09:05:41 +08:00
BTRFS_SUPER_FLAG_METADUMP | \
BTRFS_SUPER_FLAG_METADUMP_V2 )
2015-12-15 09:14:36 +08:00
2016-06-22 18:54:24 -04:00
static int btrfs_cleanup_transaction ( struct btrfs_fs_info * fs_info ) ;
static void btrfs_error_commit_super ( struct btrfs_fs_info * fs_info ) ;
2008-04-09 16:28:12 -04:00
2020-01-24 09:32:57 -05:00
static void btrfs_free_csum_hash ( struct btrfs_fs_info * fs_info )
{
if ( fs_info - > csum_shash )
crypto_free_shash ( fs_info - > csum_shash ) ;
}
2008-09-29 15:18:18 -04:00
/*
2019-02-25 14:24:15 +01:00
* Compute the csum of a btree block and store the result to provided buffer .
2008-09-29 15:18:18 -04:00
*/
2020-02-27 21:00:49 +01:00
static void csum_tree_block ( struct extent_buffer * buf , u8 * result )
2007-10-15 16:19:22 -04:00
{
2019-06-03 16:58:57 +02:00
struct btrfs_fs_info * fs_info = buf - > fs_info ;
2023-11-16 15:49:06 +10:30
int num_pages ;
u32 first_page_part ;
2019-06-03 16:58:57 +02:00
SHASH_DESC_ON_STACK ( shash , fs_info - > csum_shash ) ;
2007-10-15 16:19:22 -04:00
char * kaddr ;
2020-02-27 21:00:47 +01:00
int i ;
2019-06-03 16:58:57 +02:00
shash - > tfm = fs_info - > csum_shash ;
crypto_shash_init ( shash ) ;
2023-11-16 15:49:06 +10:30
if ( buf - > addr ) {
/* Pages are contiguous, handle them as a big one. */
kaddr = buf - > addr ;
first_page_part = fs_info - > nodesize ;
num_pages = 1 ;
} else {
2023-12-07 09:39:27 +10:30
kaddr = folio_address ( buf - > folios [ 0 ] ) ;
2023-11-16 15:49:06 +10:30
first_page_part = min_t ( u32 , PAGE_SIZE , fs_info - > nodesize ) ;
num_pages = num_extent_pages ( buf ) ;
}
2020-02-27 21:00:47 +01:00
crypto_shash_update ( shash , kaddr + BTRFS_CSUM_SIZE ,
2020-11-03 21:30:47 +08:00
first_page_part - BTRFS_CSUM_SIZE ) ;
2007-10-15 16:19:22 -04:00
2023-12-07 09:39:28 +10:30
/*
* Multiple single - page folios case would reach here .
*
* nodesize < = PAGE_SIZE and large folio all handled by above
* crypto_shash_update ( ) already .
*/
btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds
When compiling on a MIPS 64-bit machine we get these warnings:
In file included from ./arch/mips/include/asm/cacheflush.h:13,
from ./include/linux/cacheflush.h:5,
from ./include/linux/highmem.h:8,
from ./include/linux/bvec.h:10,
from ./include/linux/blk_types.h:10,
from ./include/linux/blkdev.h:9,
from fs/btrfs/disk-io.c:7:
fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
fs/btrfs/disk-io.c:100:34: error: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Werror=array-bounds]
100 | kaddr = page_address(buf->pages[i]);
| ~~~~~~~~~~^~~
./include/linux/mm.h:2135:48: note: in definition of macro ‘page_address’
2135 | #define page_address(page) lowmem_page_address(page)
| ^~~~
cc1: all warnings being treated as errors
We can check if i overflows to solve the problem. However, this doesn't make
much sense, since i == 1 and num_pages == 1 doesn't execute the body of the loop.
In addition, i < num_pages can also ensure that buf->pages[i] will not cross
the boundary. Unfortunately, this doesn't help with the problem observed here:
gcc still complains.
To fix this add a compile-time condition for the extent buffer page
array size limit, which would eventually lead to eliminating the whole
for loop.
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: pengfuyuan <pengfuyuan@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-23 15:09:55 +08:00
for ( i = 1 ; i < num_pages & & INLINE_EXTENT_BUFFER_PAGES > 1 ; i + + ) {
2023-12-07 09:39:27 +10:30
kaddr = folio_address ( buf - > folios [ i ] ) ;
2020-02-27 21:00:47 +01:00
crypto_shash_update ( shash , kaddr , PAGE_SIZE ) ;
2007-10-15 16:19:22 -04:00
}
2017-11-06 19:23:00 +01:00
memset ( result , 0 , BTRFS_CSUM_SIZE ) ;
2019-06-03 16:58:57 +02:00
crypto_shash_final ( shash , result ) ;
2007-10-15 16:19:22 -04:00
}
2008-09-29 15:18:18 -04:00
/*
* we can ' t consider a given block up to date unless the transid of the
* block matches the transid in the parent node ' s pointer . This is how we
* detect blocks that either didn ' t get written at all or got written
* in the wrong place .
*/
2023-05-03 17:24:24 +02:00
int btrfs_buffer_uptodate ( struct extent_buffer * eb , u64 parent_transid , int atomic )
2008-05-12 13:39:03 -04:00
{
2023-05-03 17:24:24 +02:00
if ( ! extent_buffer_uptodate ( eb ) )
2008-05-12 13:39:03 -04:00
return 0 ;
2023-05-03 17:24:24 +02:00
if ( ! parent_transid | | btrfs_header_generation ( eb ) = = parent_transid )
return 1 ;
2012-05-06 07:23:47 -04:00
if ( atomic )
return - EAGAIN ;
2023-05-03 17:24:24 +02:00
if ( ! extent_buffer_uptodate ( eb ) | |
btrfs_header_generation ( eb ) ! = parent_transid ) {
btrfs_err_rl ( eb - > fs_info ,
2022-06-19 21:47:56 +08:00
" parent transid verify failed on logical %llu mirror %u wanted %llu found %llu " ,
eb - > start , eb - > read_mirror ,
2014-07-04 17:59:06 +08:00
parent_transid , btrfs_header_generation ( eb ) ) ;
2023-05-03 17:24:24 +02:00
clear_extent_buffer_uptodate ( eb ) ;
2023-05-03 17:24:39 +02:00
return 0 ;
2023-05-03 17:24:24 +02:00
}
2023-05-03 17:24:39 +02:00
return 1 ;
2008-05-12 13:39:03 -04:00
}
2019-06-03 16:58:53 +02:00
static bool btrfs_supported_super_csum ( u16 csum_type )
{
switch ( csum_type ) {
case BTRFS_CSUM_TYPE_CRC32 :
2019-10-07 11:11:01 +02:00
case BTRFS_CSUM_TYPE_XXHASH :
2019-10-07 11:11:02 +02:00
case BTRFS_CSUM_TYPE_SHA256 :
2019-10-07 11:11:02 +02:00
case BTRFS_CSUM_TYPE_BLAKE2 :
2019-06-03 16:58:53 +02:00
return true ;
default :
return false ;
}
}
2013-03-06 15:57:46 +01:00
/*
* Return 0 if the superblock checksum type matches the checksum value of that
* algorithm . Pass the raw disk superblock data .
*/
2022-10-18 09:56:38 +08:00
int btrfs_check_super_csum ( struct btrfs_fs_info * fs_info ,
const struct btrfs_super_block * disk_sb )
2013-03-06 15:57:46 +01:00
{
2019-06-03 16:58:55 +02:00
char result [ BTRFS_CSUM_SIZE ] ;
2019-06-03 16:58:57 +02:00
SHASH_DESC_ON_STACK ( shash , fs_info - > csum_shash ) ;
shash - > tfm = fs_info - > csum_shash ;
2013-03-06 15:57:46 +01:00
2019-06-03 16:58:55 +02:00
/*
* The super_block structure does not span the whole
* BTRFS_SUPER_INFO_SIZE range , we expect that the unused space is
* filled with zeros and is included in the checksum .
*/
2022-10-18 09:56:38 +08:00
crypto_shash_digest ( shash , ( const u8 * ) disk_sb + BTRFS_CSUM_SIZE ,
2020-04-30 23:51:59 -07:00
BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE , result ) ;
2013-03-06 15:57:46 +01:00
2020-06-30 02:01:31 +02:00
if ( memcmp ( disk_sb - > csum , result , fs_info - > csum_size ) )
2019-06-03 16:58:55 +02:00
return 1 ;
2013-03-06 15:57:46 +01:00
2019-06-03 16:58:53 +02:00
return 0 ;
2013-03-06 15:57:46 +01:00
}
2022-11-15 10:44:06 +01:00
static int btrfs_repair_eb_io_failure ( const struct extent_buffer * eb ,
int mirror_num )
{
struct btrfs_fs_info * fs_info = eb - > fs_info ;
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
[BUG]
Test case btrfs/124 failed if larger metadata folio is enabled, the
dying message looks like this:
BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0
BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928)
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 6 PID: 350881 Comm: btrfs Tainted: G OE 6.7.0-rc3-custom+ #128
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs]
PKRU: 55555554
Call Trace:
<TASK>
read_tree_block+0x33/0xb0 [btrfs]
read_block_for_search+0x23e/0x340 [btrfs]
btrfs_search_slot+0x2f9/0xe60 [btrfs]
btrfs_lookup_csum+0x75/0x160 [btrfs]
btrfs_lookup_bio_sums+0x21a/0x560 [btrfs]
btrfs_submit_chunk+0x152/0x680 [btrfs]
btrfs_submit_bio+0x1c/0x50 [btrfs]
submit_one_bio+0x40/0x80 [btrfs]
submit_extent_page+0x158/0x390 [btrfs]
btrfs_do_readpage+0x330/0x740 [btrfs]
extent_readahead+0x38d/0x6c0 [btrfs]
read_pages+0x94/0x2c0
page_cache_ra_unbounded+0x12d/0x190
relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs]
relocate_block_group+0x2d3/0x560 [btrfs]
btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs]
btrfs_relocate_chunk+0x4c/0x1a0 [btrfs]
btrfs_balance+0x925/0x13c0 [btrfs]
btrfs_ioctl+0x19f1/0x25d0 [btrfs]
__x64_sys_ioctl+0x90/0xd0
do_syscall_64+0x3f/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
[CAUSE]
The dying line is at btrfs_repair_io_failure() call inside
btrfs_repair_eb_io_failure().
The function is still relying on the extent buffer using page sized
folios.
When the extent buffer is using larger folio, we go into the 2nd slot of
folios[], and triggered the NULL pointer dereference.
[FIX]
Migrate btrfs_repair_io_failure() to folio interfaces.
So that when we hit a larger folio, we just submit the whole folio in
one go.
This also affects data repair path through btrfs_end_repair_bio(),
thankfully data is still fully page based, we can just add an
ASSERT(), and use page_folio() to convert the page to folio.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 15:54:10 +10:30
int num_folios = num_extent_folios ( eb ) ;
2022-11-15 10:44:06 +01:00
int ret = 0 ;
if ( sb_rdonly ( fs_info - > sb ) )
return - EROFS ;
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
[BUG]
Test case btrfs/124 failed if larger metadata folio is enabled, the
dying message looks like this:
BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0
BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928)
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 6 PID: 350881 Comm: btrfs Tainted: G OE 6.7.0-rc3-custom+ #128
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs]
PKRU: 55555554
Call Trace:
<TASK>
read_tree_block+0x33/0xb0 [btrfs]
read_block_for_search+0x23e/0x340 [btrfs]
btrfs_search_slot+0x2f9/0xe60 [btrfs]
btrfs_lookup_csum+0x75/0x160 [btrfs]
btrfs_lookup_bio_sums+0x21a/0x560 [btrfs]
btrfs_submit_chunk+0x152/0x680 [btrfs]
btrfs_submit_bio+0x1c/0x50 [btrfs]
submit_one_bio+0x40/0x80 [btrfs]
submit_extent_page+0x158/0x390 [btrfs]
btrfs_do_readpage+0x330/0x740 [btrfs]
extent_readahead+0x38d/0x6c0 [btrfs]
read_pages+0x94/0x2c0
page_cache_ra_unbounded+0x12d/0x190
relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs]
relocate_block_group+0x2d3/0x560 [btrfs]
btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs]
btrfs_relocate_chunk+0x4c/0x1a0 [btrfs]
btrfs_balance+0x925/0x13c0 [btrfs]
btrfs_ioctl+0x19f1/0x25d0 [btrfs]
__x64_sys_ioctl+0x90/0xd0
do_syscall_64+0x3f/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
[CAUSE]
The dying line is at btrfs_repair_io_failure() call inside
btrfs_repair_eb_io_failure().
The function is still relying on the extent buffer using page sized
folios.
When the extent buffer is using larger folio, we go into the 2nd slot of
folios[], and triggered the NULL pointer dereference.
[FIX]
Migrate btrfs_repair_io_failure() to folio interfaces.
So that when we hit a larger folio, we just submit the whole folio in
one go.
This also affects data repair path through btrfs_end_repair_bio(),
thankfully data is still fully page based, we can just add an
ASSERT(), and use page_folio() to convert the page to folio.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 15:54:10 +10:30
for ( int i = 0 ; i < num_folios ; i + + ) {
struct folio * folio = eb - > folios [ i ] ;
u64 start = max_t ( u64 , eb - > start , folio_pos ( folio ) ) ;
2023-12-07 09:39:27 +10:30
u64 end = min_t ( u64 , eb - > start + eb - > len ,
2024-01-05 16:05:55 +10:30
folio_pos ( folio ) + eb - > folio_size ) ;
btrfs: subpage: fix a crash in metadata repair path
[BUG]
Test case btrfs/027 would crash with subpage (64K page size, 4K
sectorsize) with the following dying messages:
debug: map_length=16384 length=65536 type=metadata|raid6(0x104)
assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.c:259!
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
btrfs_assertfail+0x28/0x2c [btrfs]
btrfs_map_repair_block+0x150/0x2b8 [btrfs]
btrfs_repair_io_failure+0xd4/0x31c [btrfs]
btrfs_read_extent_buffer+0x150/0x16c [btrfs]
read_tree_block+0x38/0xbc [btrfs]
read_tree_root_path+0xfc/0x1bc [btrfs]
btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs]
open_ctree+0xa30/0x172c [btrfs]
btrfs_mount_root+0x3c4/0x4a4 [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xec
vfs_kern_mount.part.0+0x90/0xd4
vfs_kern_mount+0x14/0x28
btrfs_mount+0x114/0x418 [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xec
path_mount+0x3e0/0xb64
__arm64_sys_mount+0x200/0x2d8
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x60/0x11c
do_el0_svc+0x38/0x98
el0_svc+0x40/0xa8
el0t_64_sync_handler+0xf4/0x120
el0t_64_sync+0x190/0x194
Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000)
[CAUSE]
In btrfs/027 we test RAID6 with missing devices, in this particular
case, we're repairing a metadata at the end of a data stripe.
But at btrfs_repair_io_failure(), we always pass a full PAGE for repair,
and for subpage case this can cross stripe boundary and lead to the
above BUG_ON().
This metadata repair code is always there, since the introduction of
subpage support, but this can trigger BUG_ON() after the bio split
ability at btrfs_map_bio().
[FIX]
Instead of passing the old PAGE_SIZE, we calculate the correct length
based on the eb size and page size for both regular and subpage cases.
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 20:30:20 +08:00
u32 len = end - start ;
2022-11-15 10:44:06 +01:00
btrfs: subpage: fix a crash in metadata repair path
[BUG]
Test case btrfs/027 would crash with subpage (64K page size, 4K
sectorsize) with the following dying messages:
debug: map_length=16384 length=65536 type=metadata|raid6(0x104)
assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.c:259!
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
btrfs_assertfail+0x28/0x2c [btrfs]
btrfs_map_repair_block+0x150/0x2b8 [btrfs]
btrfs_repair_io_failure+0xd4/0x31c [btrfs]
btrfs_read_extent_buffer+0x150/0x16c [btrfs]
read_tree_block+0x38/0xbc [btrfs]
read_tree_root_path+0xfc/0x1bc [btrfs]
btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs]
open_ctree+0xa30/0x172c [btrfs]
btrfs_mount_root+0x3c4/0x4a4 [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xec
vfs_kern_mount.part.0+0x90/0xd4
vfs_kern_mount+0x14/0x28
btrfs_mount+0x114/0x418 [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xec
path_mount+0x3e0/0xb64
__arm64_sys_mount+0x200/0x2d8
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x60/0x11c
do_el0_svc+0x38/0x98
el0_svc+0x40/0xa8
el0t_64_sync_handler+0xf4/0x120
el0t_64_sync+0x190/0x194
Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000)
[CAUSE]
In btrfs/027 we test RAID6 with missing devices, in this particular
case, we're repairing a metadata at the end of a data stripe.
But at btrfs_repair_io_failure(), we always pass a full PAGE for repair,
and for subpage case this can cross stripe boundary and lead to the
above BUG_ON().
This metadata repair code is always there, since the introduction of
subpage support, but this can trigger BUG_ON() after the bio split
ability at btrfs_map_bio().
[FIX]
Instead of passing the old PAGE_SIZE, we calculate the correct length
based on the eb size and page size for both regular and subpage cases.
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 20:30:20 +08:00
ret = btrfs_repair_io_failure ( fs_info , 0 , start , len ,
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
[BUG]
Test case btrfs/124 failed if larger metadata folio is enabled, the
dying message looks like this:
BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0
BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928)
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 6 PID: 350881 Comm: btrfs Tainted: G OE 6.7.0-rc3-custom+ #128
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs]
PKRU: 55555554
Call Trace:
<TASK>
read_tree_block+0x33/0xb0 [btrfs]
read_block_for_search+0x23e/0x340 [btrfs]
btrfs_search_slot+0x2f9/0xe60 [btrfs]
btrfs_lookup_csum+0x75/0x160 [btrfs]
btrfs_lookup_bio_sums+0x21a/0x560 [btrfs]
btrfs_submit_chunk+0x152/0x680 [btrfs]
btrfs_submit_bio+0x1c/0x50 [btrfs]
submit_one_bio+0x40/0x80 [btrfs]
submit_extent_page+0x158/0x390 [btrfs]
btrfs_do_readpage+0x330/0x740 [btrfs]
extent_readahead+0x38d/0x6c0 [btrfs]
read_pages+0x94/0x2c0
page_cache_ra_unbounded+0x12d/0x190
relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs]
relocate_block_group+0x2d3/0x560 [btrfs]
btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs]
btrfs_relocate_chunk+0x4c/0x1a0 [btrfs]
btrfs_balance+0x925/0x13c0 [btrfs]
btrfs_ioctl+0x19f1/0x25d0 [btrfs]
__x64_sys_ioctl+0x90/0xd0
do_syscall_64+0x3f/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
[CAUSE]
The dying line is at btrfs_repair_io_failure() call inside
btrfs_repair_eb_io_failure().
The function is still relying on the extent buffer using page sized
folios.
When the extent buffer is using larger folio, we go into the 2nd slot of
folios[], and triggered the NULL pointer dereference.
[FIX]
Migrate btrfs_repair_io_failure() to folio interfaces.
So that when we hit a larger folio, we just submit the whole folio in
one go.
This also affects data repair path through btrfs_end_repair_bio(),
thankfully data is still fully page based, we can just add an
ASSERT(), and use page_folio() to convert the page to folio.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 15:54:10 +10:30
start , folio , offset_in_folio ( folio , start ) ,
mirror_num ) ;
2022-11-15 10:44:06 +01:00
if ( ret )
break ;
}
return ret ;
}
2008-09-29 15:18:18 -04:00
/*
* helper to read a given tree block , doing retries as required when
* the checksums don ' t match and we have alternate mirrors to try .
2018-03-29 09:08:11 +08:00
*
2022-09-14 13:32:50 +08:00
* @ check : expected tree parentness check , see the comments of the
* structure for details .
2008-09-29 15:18:18 -04:00
*/
2022-03-11 11:35:34 +00:00
int btrfs_read_extent_buffer ( struct extent_buffer * eb ,
2024-05-30 19:14:12 +02:00
const struct btrfs_tree_parent_check * check )
2008-04-09 16:28:12 -04:00
{
2019-03-20 14:56:39 +01:00
struct btrfs_fs_info * fs_info = eb - > fs_info ;
2012-03-26 21:57:36 -04:00
int failed = 0 ;
2008-04-09 16:28:12 -04:00
int ret ;
int num_copies = 0 ;
int mirror_num = 0 ;
2012-03-26 21:57:36 -04:00
int failed_mirror = 0 ;
2008-04-09 16:28:12 -04:00
2022-09-14 13:32:50 +08:00
ASSERT ( check ) ;
2008-04-09 16:28:12 -04:00
while ( 1 ) {
2018-11-06 16:40:20 +02:00
clear_bit ( EXTENT_BUFFER_CORRUPT , & eb - > bflags ) ;
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 13:32:51 +08:00
ret = read_extent_buffer_pages ( eb , WAIT_COMPLETE , mirror_num , check ) ;
if ( ! ret )
break ;
2009-01-05 21:25:51 -05:00
2016-06-22 18:54:23 -04:00
num_copies = btrfs_num_copies ( fs_info ,
2008-04-09 16:28:12 -04:00
eb - > start , eb - > len ) ;
2008-04-28 16:40:52 -04:00
if ( num_copies = = 1 )
2012-03-26 21:57:36 -04:00
break ;
2008-04-28 16:40:52 -04:00
2012-04-16 09:42:26 -04:00
if ( ! failed_mirror ) {
failed = 1 ;
failed_mirror = eb - > read_mirror ;
}
2008-04-09 16:28:12 -04:00
mirror_num + + ;
2012-03-26 21:57:36 -04:00
if ( mirror_num = = failed_mirror )
mirror_num + + ;
2008-04-28 16:40:52 -04:00
if ( mirror_num > num_copies )
2012-03-26 21:57:36 -04:00
break ;
2008-04-09 16:28:12 -04:00
}
2012-03-26 21:57:36 -04:00
2012-07-10 07:30:17 -06:00
if ( failed & & ! ret & & failed_mirror )
2019-03-20 11:23:44 +01:00
btrfs_repair_eb_io_failure ( eb , failed_mirror ) ;
2012-03-26 21:57:36 -04:00
return ret ;
2008-04-09 16:28:12 -04:00
}
2007-10-15 16:19:22 -04:00
2023-05-03 17:24:35 +02:00
/*
* Checksum a dirty tree block before IO .
*/
blk_status_t btree_csum_one_bio ( struct btrfs_bio * bbio )
2021-03-25 15:14:40 +08:00
{
2023-05-03 17:24:35 +02:00
struct extent_buffer * eb = bbio - > private ;
2021-03-25 15:14:40 +08:00
struct btrfs_fs_info * fs_info = eb - > fs_info ;
2023-05-03 17:24:35 +02:00
u64 found_start = btrfs_header_bytenr ( eb ) ;
2023-10-04 11:38:51 +01:00
u64 last_trans ;
2021-03-25 15:14:40 +08:00
u8 result [ BTRFS_CSUM_SIZE ] ;
int ret ;
2023-05-03 17:24:35 +02:00
/* Btree blocks are always contiguous on disk. */
if ( WARN_ON_ONCE ( bbio - > file_offset ! = eb - > start ) )
return BLK_STS_IOERR ;
if ( WARN_ON_ONCE ( bbio - > bio . bi_iter . bi_size ! = eb - > len ) )
return BLK_STS_IOERR ;
2023-11-23 07:47:16 -08:00
/*
* If an extent_buffer is marked as EXTENT_BUFFER_ZONED_ZEROOUT , don ' t
* checksum it but zero - out its content . This is done to preserve
* ordering of I / O without unnecessarily writing out data .
*/
2023-11-23 07:47:15 -08:00
if ( test_bit ( EXTENT_BUFFER_ZONED_ZEROOUT , & eb - > bflags ) ) {
2023-11-23 07:47:16 -08:00
memzero_extent_buffer ( eb , 0 , eb - > len ) ;
2023-05-03 17:24:35 +02:00
return BLK_STS_OK ;
}
if ( WARN_ON_ONCE ( found_start ! = eb - > start ) )
return BLK_STS_IOERR ;
2023-12-12 12:58:37 +10:30
if ( WARN_ON ( ! btrfs_folio_test_uptodate ( fs_info , eb - > folios [ 0 ] ,
eb - > start , eb - > len ) ) )
2023-05-03 17:24:35 +02:00
return BLK_STS_IOERR ;
2021-03-25 15:14:40 +08:00
ASSERT ( memcmp_extent_buffer ( eb , fs_info - > fs_devices - > metadata_uuid ,
offsetof ( struct btrfs_header , fsid ) ,
BTRFS_FSID_SIZE ) = = 0 ) ;
csum_tree_block ( eb , result ) ;
if ( btrfs_header_level ( eb ) )
ret = btrfs_check_node ( eb ) ;
else
2023-04-29 16:07:12 -04:00
ret = btrfs_check_leaf ( eb ) ;
2021-03-25 15:14:40 +08:00
btrfs: verify the tranisd of the to-be-written dirty extent buffer
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:
BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.
It's an obvious bitflip.
Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.
As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.
So such problem remain undetected.
[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.
Function csum_one_extent_buffer() is called before we submit metadata
write.
Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.
Using that we can catch the above bitflip.
Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02 09:10:21 +08:00
if ( ret < 0 )
goto error ;
/*
* Also check the generation , the eb reached here must be newer than
* last committed . Or something seriously wrong happened .
*/
2023-10-04 11:38:51 +01:00
last_trans = btrfs_get_last_trans_committed ( fs_info ) ;
if ( unlikely ( btrfs_header_generation ( eb ) < = last_trans ) ) {
btrfs: verify the tranisd of the to-be-written dirty extent buffer
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:
BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.
It's an obvious bitflip.
Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.
As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.
So such problem remain undetected.
[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.
Function csum_one_extent_buffer() is called before we submit metadata
write.
Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.
Using that we can catch the above bitflip.
Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02 09:10:21 +08:00
ret = - EUCLEAN ;
2021-03-25 15:14:40 +08:00
btrfs_err ( fs_info ,
btrfs: verify the tranisd of the to-be-written dirty extent buffer
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:
BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.
It's an obvious bitflip.
Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.
As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.
So such problem remain undetected.
[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.
Function csum_one_extent_buffer() is called before we submit metadata
write.
Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.
Using that we can catch the above bitflip.
Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02 09:10:21 +08:00
" block=%llu bad generation, have %llu expect > %llu " ,
2023-10-04 11:38:51 +01:00
eb - > start , btrfs_header_generation ( eb ) , last_trans ) ;
btrfs: verify the tranisd of the to-be-written dirty extent buffer
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:
BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.
It's an obvious bitflip.
Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.
As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.
So such problem remain undetected.
[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.
Function csum_one_extent_buffer() is called before we submit metadata
write.
Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.
Using that we can catch the above bitflip.
Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02 09:10:21 +08:00
goto error ;
2021-03-25 15:14:40 +08:00
}
write_extent_buffer ( eb , result , 0 , fs_info - > csum_size ) ;
2023-05-03 17:24:35 +02:00
return BLK_STS_OK ;
btrfs: verify the tranisd of the to-be-written dirty extent buffer
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:
BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.
It's an obvious bitflip.
Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.
As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.
So such problem remain undetected.
[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.
Function csum_one_extent_buffer() is called before we submit metadata
write.
Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.
Using that we can catch the above bitflip.
Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02 09:10:21 +08:00
error :
btrfs_print_tree ( eb , 0 ) ;
btrfs_err ( fs_info , " block=%llu write time tree block corruption detected " ,
eb - > start ) ;
2023-01-10 14:56:37 +00:00
/*
* Be noisy if this is an extent buffer from a log tree . We don ' t abort
* a transaction in case there ' s a bad log tree extent buffer , we just
* fallback to a transaction commit . Still we want to know when there is
* a bad log tree extent buffer , as that may signal a bug somewhere .
*/
WARN_ON ( IS_ENABLED ( CONFIG_BTRFS_DEBUG ) | |
btrfs_header_owner ( eb ) = = BTRFS_TREE_LOG_OBJECTID ) ;
2023-01-21 07:50:15 +01:00
return errno_to_blk_status ( ret ) ;
}
2023-05-24 20:02:39 +08:00
static bool check_tree_block_fsid ( struct extent_buffer * eb )
2008-11-17 21:11:30 -05:00
{
2019-03-20 13:12:00 +01:00
struct btrfs_fs_info * fs_info = eb - > fs_info ;
2020-07-16 10:25:33 +03:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices , * seed_devs ;
2017-07-29 17:50:09 +08:00
u8 fsid [ BTRFS_FSID_SIZE ] ;
2008-11-17 21:11:30 -05:00
2019-03-20 13:15:57 +01:00
read_extent_buffer ( eb , fsid , offsetof ( struct btrfs_header , fsid ) ,
BTRFS_FSID_SIZE ) ;
2023-07-31 19:16:36 +08:00
2020-07-16 10:25:33 +03:00
/*
2023-08-23 22:52:13 +08:00
* alloc_fsid_devices ( ) copies the fsid into fs_devices : : metadata_uuid .
* This is then overwritten by metadata_uuid if it is present in the
* device_list_add ( ) . The same true for a seed device as well . So use of
* fs_devices : : metadata_uuid is appropriate here .
2020-07-16 10:25:33 +03:00
*/
2023-07-31 19:16:36 +08:00
if ( memcmp ( fsid , fs_info - > fs_devices - > metadata_uuid , BTRFS_FSID_SIZE ) = = 0 )
2023-05-24 20:02:39 +08:00
return false ;
2020-07-16 10:25:33 +03:00
list_for_each_entry ( seed_devs , & fs_devices - > seed_list , seed_list )
if ( ! memcmp ( fsid , seed_devs - > fsid , BTRFS_FSID_SIZE ) )
2023-05-24 20:02:39 +08:00
return false ;
2020-07-16 10:25:33 +03:00
2023-05-24 20:02:39 +08:00
return true ;
2008-11-17 21:11:30 -05:00
}
2020-11-03 21:30:48 +08:00
/* Do basic extent buffer checks at read time */
2023-05-03 17:24:28 +02:00
int btrfs_validate_extent_buffer ( struct extent_buffer * eb ,
2024-05-30 19:14:12 +02:00
const struct btrfs_tree_parent_check * check )
2008-04-09 16:28:12 -04:00
{
2020-11-03 21:30:48 +08:00
struct btrfs_fs_info * fs_info = eb - > fs_info ;
2008-04-09 16:28:12 -04:00
u64 found_start ;
2020-11-03 21:30:48 +08:00
const u32 csum_size = fs_info - > csum_size ;
u8 found_level ;
2019-02-25 14:24:15 +01:00
u8 result [ BTRFS_CSUM_SIZE ] ;
2020-09-21 22:07:14 +02:00
const u8 * header_csum ;
2020-11-03 21:30:48 +08:00
int ret = 0 ;
2024-06-14 13:52:30 +09:30
const bool ignore_csum = btrfs_test_opt ( fs_info , IGNOREMETACSUMS ) ;
2012-03-26 21:57:36 -04:00
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 13:32:51 +08:00
ASSERT ( check ) ;
2008-04-09 16:28:12 -04:00
found_start = btrfs_header_bytenr ( eb ) ;
2010-08-06 13:21:20 -04:00
if ( found_start ! = eb - > start ) {
2022-06-19 21:47:56 +08:00
btrfs_err_rl ( fs_info ,
" bad tree block start, mirror %u want %llu have %llu " ,
eb - > read_mirror , eb - > start , found_start ) ;
2008-04-09 16:28:12 -04:00
ret = - EIO ;
2020-11-03 21:30:48 +08:00
goto out ;
2008-04-09 16:28:12 -04:00
}
2019-03-20 13:12:00 +01:00
if ( check_tree_block_fsid ( eb ) ) {
2022-06-19 21:47:56 +08:00
btrfs_err_rl ( fs_info , " bad fsid on logical %llu mirror %u " ,
eb - > start , eb - > read_mirror ) ;
2008-05-12 13:39:03 -04:00
ret = - EIO ;
2020-11-03 21:30:48 +08:00
goto out ;
2008-05-12 13:39:03 -04:00
}
2008-04-09 16:28:12 -04:00
found_level = btrfs_header_level ( eb ) ;
2013-04-23 11:30:14 -04:00
if ( found_level > = BTRFS_MAX_LEVEL ) {
2022-06-19 21:47:56 +08:00
btrfs_err ( fs_info ,
" bad tree block level, mirror %u level %d on logical %llu " ,
eb - > read_mirror , btrfs_header_level ( eb ) , eb - > start ) ;
2013-04-23 11:30:14 -04:00
ret = - EIO ;
2020-11-03 21:30:48 +08:00
goto out ;
2013-04-23 11:30:14 -04:00
}
2008-04-09 16:28:12 -04:00
2020-02-27 21:00:49 +01:00
csum_tree_block ( eb , result ) ;
2023-12-07 09:39:27 +10:30
header_csum = folio_address ( eb - > folios [ 0 ] ) +
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 12:58:36 +10:30
get_eb_offset_in_folio ( eb , offsetof ( struct btrfs_header , csum ) ) ;
2011-03-16 13:42:43 -04:00
2020-09-21 22:07:14 +02:00
if ( memcmp ( result , header_csum , csum_size ) ! = 0 ) {
2019-02-25 14:24:15 +01:00
btrfs_warn_rl ( fs_info ,
2024-06-14 13:52:30 +09:30
" checksum verify failed on logical %llu mirror %u wanted " CSUM_FMT " found " CSUM_FMT " level %d%s " ,
2022-06-19 21:47:56 +08:00
eb - > start , eb - > read_mirror ,
2020-09-21 22:07:14 +02:00
CSUM_FMT_VALUE ( csum_size , header_csum ) ,
2020-09-21 16:57:14 +09:00
CSUM_FMT_VALUE ( csum_size , result ) ,
2024-06-14 13:52:30 +09:30
btrfs_header_level ( eb ) ,
ignore_csum ? " , ignored " : " " ) ;
if ( ! ignore_csum ) {
ret = - EUCLEAN ;
goto out ;
}
2019-02-25 14:24:15 +01:00
}
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 13:32:51 +08:00
if ( found_level ! = check - > level ) {
2022-12-29 07:32:23 +08:00
btrfs_err ( fs_info ,
" level verify failed on logical %llu mirror %u wanted %u found %u " ,
eb - > start , eb - > read_mirror , check - > level , found_level ) ;
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 13:32:51 +08:00
ret = - EIO ;
goto out ;
}
if ( unlikely ( check - > transid & &
btrfs_header_generation ( eb ) ! = check - > transid ) ) {
btrfs_err_rl ( eb - > fs_info ,
" parent transid verify failed on logical %llu mirror %u wanted %llu found %llu " ,
eb - > start , eb - > read_mirror , check - > transid ,
btrfs_header_generation ( eb ) ) ;
ret = - EIO ;
goto out ;
}
if ( check - > has_first_key ) {
2024-05-30 19:14:12 +02:00
const struct btrfs_key * expect_key = & check - > first_key ;
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 13:32:51 +08:00
struct btrfs_key found_key ;
if ( found_level )
btrfs_node_key_to_cpu ( eb , & found_key , 0 ) ;
else
btrfs_item_key_to_cpu ( eb , & found_key , 0 ) ;
if ( unlikely ( btrfs_comp_cpu_keys ( expect_key , & found_key ) ) ) {
btrfs_err ( fs_info ,
" tree first key mismatch detected, bytenr=%llu parent_transid=%llu key expected=(%llu,%u,%llu) has=(%llu,%u,%llu) " ,
eb - > start , check - > transid ,
expect_key - > objectid ,
expect_key - > type , expect_key - > offset ,
found_key . objectid , found_key . type ,
found_key . offset ) ;
ret = - EUCLEAN ;
goto out ;
}
}
if ( check - > owner_root ) {
ret = btrfs_check_eb_owner ( eb , check - > owner_root ) ;
if ( ret < 0 )
goto out ;
}
2011-03-16 13:42:43 -04:00
/*
* If this is a leaf block and it is corrupt , set the corrupt bit so
* that we don ' t try and read the other copies of this block , just
* return - EIO .
*/
2023-04-29 16:07:12 -04:00
if ( found_level = = 0 & & btrfs_check_leaf ( eb ) ) {
2011-03-16 13:42:43 -04:00
set_bit ( EXTENT_BUFFER_CORRUPT , & eb - > bflags ) ;
ret = - EIO ;
}
2008-04-09 16:28:12 -04:00
2019-03-20 16:25:00 +01:00
if ( found_level > 0 & & btrfs_check_node ( eb ) )
2016-08-23 17:37:45 -07:00
ret = - EIO ;
2023-05-03 17:24:23 +02:00
if ( ret )
2019-03-20 14:27:40 +08:00
btrfs_err ( fs_info ,
2022-06-19 21:47:56 +08:00
" read time tree block corruption detected on logical %llu mirror %u " ,
eb - > start , eb - > read_mirror ) ;
2020-11-03 21:30:48 +08:00
out :
return ret ;
}
2010-12-07 14:54:09 +00:00
# ifdef CONFIG_MIGRATION
2022-06-06 09:22:19 -04:00
static int btree_migrate_folio ( struct address_space * mapping ,
struct folio * dst , struct folio * src , enum migrate_mode mode )
2010-11-21 22:20:49 -05:00
{
/*
* we can ' t safely write a btree page from here ,
* we haven ' t done the locking hook
*/
2022-06-06 09:22:19 -04:00
if ( folio_test_dirty ( src ) )
2010-11-21 22:20:49 -05:00
return - EAGAIN ;
/*
* Buffers may be managed in a filesystem specific way .
* We must have no buffers or drop them .
*/
2022-06-06 09:22:19 -04:00
if ( folio_get_private ( src ) & &
! filemap_release_folio ( src , GFP_KERNEL ) )
2010-11-21 22:20:49 -05:00
return - EAGAIN ;
2022-06-06 10:27:41 -04:00
return migrate_folio ( mapping , dst , src , mode ) ;
2010-11-21 22:20:49 -05:00
}
2022-06-06 09:22:19 -04:00
# else
# define btree_migrate_folio NULL
2010-12-07 14:54:09 +00:00
# endif
2010-11-21 22:20:49 -05:00
2007-11-07 21:08:01 -05:00
static int btree_writepages ( struct address_space * mapping ,
struct writeback_control * wbc )
{
2013-01-29 10:09:20 +00:00
int ret ;
2007-12-11 12:42:00 -05:00
if ( wbc - > sync_mode = = WB_SYNC_NONE ) {
2023-09-14 16:45:41 +02:00
struct btrfs_fs_info * fs_info ;
2007-11-27 07:52:01 -08:00
if ( wbc - > for_kupdate )
return 0 ;
2023-09-14 16:45:41 +02:00
fs_info = inode_to_fs_info ( mapping - > host ) ;
2009-03-13 11:00:37 -04:00
/* this is a bit racy, but that's ok */
2018-07-02 15:44:58 +08:00
ret = __percpu_counter_compare ( & fs_info - > dirty_metadata_bytes ,
BTRFS_DIRTY_METADATA_THRESH ,
fs_info - > dirty_metadata_batch ) ;
2013-01-29 10:09:20 +00:00
if ( ret < 0 )
2007-11-26 16:34:41 -08:00
return 0 ;
}
2012-03-13 09:38:00 -04:00
return btree_write_cache_pages ( mapping , wbc ) ;
2007-11-07 21:08:01 -05:00
}
2022-04-30 23:15:16 -04:00
static bool btree_release_folio ( struct folio * folio , gfp_t gfp_flags )
2007-10-15 16:14:19 -04:00
{
2022-04-30 23:15:16 -04:00
if ( folio_test_writeback ( folio ) | | folio_test_dirty ( folio ) )
return false ;
2012-01-26 15:01:12 -05:00
2022-04-30 23:15:16 -04:00
return try_release_extent_buffer ( & folio - > page ) ;
2007-03-28 13:57:48 -04:00
}
2022-02-09 20:21:39 +00:00
static void btree_invalidate_folio ( struct folio * folio , size_t offset ,
size_t length )
2007-03-28 13:57:48 -04:00
{
2008-01-24 16:13:08 -05:00
struct extent_io_tree * tree ;
2023-09-13 16:11:29 +02:00
tree = & folio_to_inode ( folio ) - > io_tree ;
2022-02-09 20:21:39 +00:00
extent_invalidate_folio ( tree , folio , offset ) ;
2022-04-30 23:15:16 -04:00
btree_release_folio ( folio , GFP_NOFS ) ;
2022-02-09 20:21:39 +00:00
if ( folio_get_private ( folio ) ) {
2023-09-14 16:24:43 +02:00
btrfs_warn ( folio_to_fs_info ( folio ) ,
2022-02-09 20:21:39 +00:00
" folio private not zero on folio %llu " ,
( unsigned long long ) folio_pos ( folio ) ) ;
folio_detach_private ( folio ) ;
2008-04-18 16:11:30 -04:00
}
2007-03-28 13:57:48 -04:00
}
2012-10-15 13:30:43 -04:00
# ifdef DEBUG
2022-02-09 20:22:02 +00:00
static bool btree_dirty_folio ( struct address_space * mapping ,
struct folio * folio )
{
2023-09-14 16:45:41 +02:00
struct btrfs_fs_info * fs_info = inode_to_fs_info ( mapping - > host ) ;
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG]
After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22 13:50:51 +08:00
struct btrfs_subpage_info * spi = fs_info - > subpage_info ;
2021-03-25 15:14:39 +08:00
struct btrfs_subpage * subpage ;
2012-03-13 09:38:00 -04:00
struct extent_buffer * eb ;
2021-03-25 15:14:39 +08:00
int cur_bit = 0 ;
2022-02-09 20:22:02 +00:00
u64 page_start = folio_pos ( folio ) ;
2021-03-25 15:14:39 +08:00
if ( fs_info - > sectorsize = = PAGE_SIZE ) {
2022-02-09 20:22:02 +00:00
eb = folio_get_private ( folio ) ;
2021-03-25 15:14:39 +08:00
BUG_ON ( ! eb ) ;
BUG_ON ( ! test_bit ( EXTENT_BUFFER_DIRTY , & eb - > bflags ) ) ;
BUG_ON ( ! atomic_read ( & eb - > refs ) ) ;
2021-09-22 10:36:45 +01:00
btrfs_assert_tree_write_locked ( eb ) ;
2022-02-09 20:22:02 +00:00
return filemap_dirty_folio ( mapping , folio ) ;
2021-03-25 15:14:39 +08:00
}
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG]
After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22 13:50:51 +08:00
ASSERT ( spi ) ;
2022-02-09 20:22:02 +00:00
subpage = folio_get_private ( folio ) ;
2021-03-25 15:14:39 +08:00
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG]
After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22 13:50:51 +08:00
for ( cur_bit = spi - > dirty_offset ;
cur_bit < spi - > dirty_offset + spi - > bitmap_nr_bits ;
cur_bit + + ) {
2021-03-25 15:14:39 +08:00
unsigned long flags ;
u64 cur ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG]
After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22 13:50:51 +08:00
if ( ! test_bit ( cur_bit , subpage - > bitmaps ) ) {
2021-03-25 15:14:39 +08:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
continue ;
}
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
cur = page_start + cur_bit * fs_info - > sectorsize ;
2012-03-13 09:38:00 -04:00
2021-03-25 15:14:39 +08:00
eb = find_extent_buffer ( fs_info , cur ) ;
ASSERT ( eb ) ;
ASSERT ( test_bit ( EXTENT_BUFFER_DIRTY , & eb - > bflags ) ) ;
ASSERT ( atomic_read ( & eb - > refs ) ) ;
2021-09-22 10:36:45 +01:00
btrfs_assert_tree_write_locked ( eb ) ;
2021-03-25 15:14:39 +08:00
free_extent_buffer ( eb ) ;
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG]
After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22 13:50:51 +08:00
cur_bit + = ( fs_info - > nodesize > > fs_info - > sectorsize_bits ) - 1 ;
2021-03-25 15:14:39 +08:00
}
2022-02-09 20:22:02 +00:00
return filemap_dirty_folio ( mapping , folio ) ;
2012-03-13 09:38:00 -04:00
}
2022-02-09 20:22:02 +00:00
# else
# define btree_dirty_folio filemap_dirty_folio
# endif
2012-03-13 09:38:00 -04:00
2009-09-21 17:01:10 -07:00
static const struct address_space_operations btree_aops = {
2007-11-07 21:08:01 -05:00
. writepages = btree_writepages ,
2022-04-30 23:15:16 -04:00
. release_folio = btree_release_folio ,
2022-02-09 20:21:39 +00:00
. invalidate_folio = btree_invalidate_folio ,
2022-06-06 09:22:19 -04:00
. migrate_folio = btree_migrate_folio ,
. dirty_folio = btree_dirty_folio ,
2007-03-28 13:57:48 -04:00
} ;
2016-06-22 18:54:24 -04:00
struct extent_buffer * btrfs_find_create_tree_block (
struct btrfs_fs_info * fs_info ,
2020-11-05 10:45:20 -05:00
u64 bytenr , u64 owner_root ,
int level )
2008-04-01 13:48:14 -04:00
{
2016-06-22 18:54:23 -04:00
if ( btrfs_is_testing ( fs_info ) )
return alloc_test_extent_buffer ( fs_info , bytenr ) ;
2020-11-05 10:45:20 -05:00
return alloc_extent_buffer ( fs_info , bytenr , owner_root , level ) ;
2008-04-01 13:48:14 -04:00
}
2018-03-29 09:08:11 +08:00
/*
* Read tree block at logical address @ bytenr and do variant basic but critical
* verification .
*
2022-09-14 13:32:50 +08:00
* @ check : expected tree parentness check , see comments of the
* structure for details .
2018-03-29 09:08:11 +08:00
*/
2016-06-22 18:54:24 -04:00
struct extent_buffer * read_tree_block ( struct btrfs_fs_info * fs_info , u64 bytenr ,
2022-09-14 13:32:50 +08:00
struct btrfs_tree_parent_check * check )
2008-04-01 13:48:14 -04:00
{
struct extent_buffer * buf = NULL ;
int ret ;
2022-09-14 13:32:50 +08:00
ASSERT ( check ) ;
buf = btrfs_find_create_tree_block ( fs_info , bytenr , check - > owner_root ,
check - > level ) ;
2016-06-06 12:01:23 -07:00
if ( IS_ERR ( buf ) )
return buf ;
2008-04-01 13:48:14 -04:00
2022-09-14 13:32:50 +08:00
ret = btrfs_read_extent_buffer ( buf , check ) ;
2013-07-31 00:39:56 +01:00
if ( ret ) {
2019-03-14 09:52:35 +02:00
free_extent_buffer_stale ( buf ) ;
2015-05-25 17:30:15 +08:00
return ERR_PTR ( ret ) ;
2013-07-31 00:39:56 +01:00
}
2007-10-15 16:14:19 -04:00
return buf ;
2008-04-09 16:28:12 -04:00
2007-02-02 09:18:22 -05:00
}
2016-06-15 09:22:56 -04:00
static void __setup_root ( struct btrfs_root * root , struct btrfs_fs_info * fs_info ,
2012-03-01 14:56:26 +01:00
u64 objectid )
2007-02-20 16:40:44 -05:00
{
2024-04-18 00:47:13 +02:00
bool dummy = btrfs_is_testing ( fs_info ) ;
2021-11-05 16:45:44 -04:00
memset ( & root - > root_key , 0 , sizeof ( root - > root_key ) ) ;
memset ( & root - > root_item , 0 , sizeof ( root - > root_item ) ) ;
memset ( & root - > defrag_progress , 0 , sizeof ( root - > defrag_progress ) ) ;
2020-01-24 09:32:18 -05:00
root - > fs_info = fs_info ;
2021-11-05 16:45:44 -04:00
root - > root_key . objectid = objectid ;
2007-02-21 17:04:57 -05:00
root - > node = NULL ;
2007-03-06 20:08:01 -05:00
root - > commit_root = NULL ;
2014-04-02 19:51:05 +08:00
root - > state = 0 ;
2021-11-05 16:45:51 -04:00
RB_CLEAR_NODE ( & root - > rb_node ) ;
2008-03-24 15:01:56 -04:00
2024-07-01 10:51:28 +01:00
btrfs_set_root_last_trans ( root , 0 ) ;
2020-12-07 17:32:35 +02:00
root - > free_objectid = 0 ;
2013-05-15 07:48:22 +00:00
root - > nr_delalloc_inodes = 0 ;
2013-05-15 07:48:23 +00:00
root - > nr_ordered_extents = 0 ;
btrfs: use an xarray to track open inodes in a root
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.
This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 92081.6 12505547
16 2400000 0 138222.6 13067072
23 3600000 0 148833.1 13290336
43 4800000 0 97864.7 13931248
53 6000000 0 85597.3 14384313
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 93225.1 12571078
16 2400000 0 146720.3 12805007
23 3600000 0 160626.4 13073835
46 4800000 0 116286.2 13802927
53 6000000 0 90087.9 14754892
The test was run with a release kernel config (Debian's default config).
Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):
Before this patch, inode_tree_add() execution times:
Count: 5000000
Range: 0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
Percentiles: 90th: 1015.000; 95th: 1139.000; 99th: 1397.000
0.000 - 7.816: 40 |
7.816 - 37.858: 209 |
37.858 - 170.278: 6059 |
170.278 - 753.961: 2754890 #####################################################
753.961 - 3326.728: 2232312 ###########################################
3326.728 - 14667.018: 4366 |
14667.018 - 64652.943: 852 |
64652.943 - 284981.761: 550 |
284981.761 - 1256150.914: 221 |
1256150.914 - 5536887.000: 7 |
After this patch, btrfs_add_inode_to_root() execution times:
Count: 5000000
Range: 0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
Percentiles: 90th: 342.000; 95th: 432.000; 99th: 572.000
0.000 - 7.264: 104 |
7.264 - 33.145: 352 |
33.145 - 140.081: 109606 #
140.081 - 581.930: 4840090 #####################################################
581.930 - 2407.590: 43532 |
2407.590 - 9950.979: 2245 |
9950.979 - 41119.278: 514 |
41119.278 - 169902.616: 155 |
169902.616 - 702018.539: 47 |
702018.539 - 2900652.000: 9 |
Average, percentiles, standard deviation, etc, are all much better.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-24 16:58:01 +01:00
xa_init ( & root - > inodes ) ;
2024-04-17 16:06:13 +01:00
xa_init ( & root - > delayed_nodes ) ;
2021-11-05 16:45:44 -04:00
btrfs_init_root_block_rsv ( root ) ;
2008-03-24 15:01:56 -04:00
INIT_LIST_HEAD ( & root - > dirty_list ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
INIT_LIST_HEAD ( & root - > root_list ) ;
2013-05-15 07:48:22 +00:00
INIT_LIST_HEAD ( & root - > delalloc_inodes ) ;
INIT_LIST_HEAD ( & root - > delalloc_root ) ;
2013-05-15 07:48:23 +00:00
INIT_LIST_HEAD ( & root - > ordered_extents ) ;
INIT_LIST_HEAD ( & root - > ordered_root ) ;
2019-01-23 15:15:14 +08:00
INIT_LIST_HEAD ( & root - > reloc_dirty_list ) ;
2013-05-15 07:48:22 +00:00
spin_lock_init ( & root - > delalloc_lock ) ;
2013-05-15 07:48:23 +00:00
spin_lock_init ( & root - > ordered_extent_lock ) ;
2010-05-16 10:46:25 -04:00
spin_lock_init ( & root - > accounting_lock ) ;
2017-12-12 15:34:34 +08:00
spin_lock_init ( & root - > qgroup_meta_rsv_lock ) ;
2008-06-25 16:01:30 -04:00
mutex_init ( & root - > objectid_mutex ) ;
2008-09-05 16:13:11 -04:00
mutex_init ( & root - > log_mutex ) ;
2014-03-06 13:55:02 +08:00
mutex_init ( & root - > ordered_extent_mutex ) ;
2014-03-06 13:55:03 +08:00
mutex_init ( & root - > delalloc_mutex ) ;
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-13 18:50:48 +08:00
init_waitqueue_head ( & root - > qgroup_flush_wait ) ;
2009-01-21 12:54:03 -05:00
init_waitqueue_head ( & root - > log_writer_wait ) ;
init_waitqueue_head ( & root - > log_commit_wait [ 0 ] ) ;
init_waitqueue_head ( & root - > log_commit_wait [ 1 ] ) ;
2014-02-20 18:08:58 +08:00
INIT_LIST_HEAD ( & root - > log_ctxs [ 0 ] ) ;
INIT_LIST_HEAD ( & root - > log_ctxs [ 1 ] ) ;
2009-01-21 12:54:03 -05:00
atomic_set ( & root - > log_commit [ 0 ] , 0 ) ;
atomic_set ( & root - > log_commit [ 1 ] , 0 ) ;
atomic_set ( & root - > log_writers , 0 ) ;
2012-09-06 04:04:27 -06:00
atomic_set ( & root - > log_batch , 0 ) ;
2017-03-03 10:55:18 +02:00
refcount_set ( & root - > refs , 1 ) ;
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.
The steps leading to this problem are:
1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.
2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).
3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.
So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:
1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.
2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).
3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.
Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 10:30:30 +08:00
atomic_set ( & root - > snapshot_force_cow , 0 ) ;
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
atomic_set ( & root - > nr_swapfiles , 0 ) ;
2023-10-04 11:38:49 +01:00
btrfs_set_root_log_transid ( root , 0 ) ;
2014-02-20 18:08:59 +08:00
root - > log_transid_committed = - 1 ;
2023-10-04 11:38:48 +01:00
btrfs_set_root_last_log_commit ( root , 0 ) ;
2021-11-05 16:45:44 -04:00
root - > anon_dev = 0 ;
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:14:50 +01:00
if ( ! dummy ) {
2019-03-01 10:47:59 +08:00
extent_io_tree_init ( fs_info , & root - > dirty_log_pages ,
2022-10-28 02:47:06 +02:00
IO_TREE_ROOT_DIRTY_LOG_PAGES ) ;
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:14:50 +01:00
extent_io_tree_init ( fs_info , & root - > log_csum_range ,
2022-10-28 02:47:06 +02:00
IO_TREE_LOG_CSUM_RANGE ) ;
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:14:50 +01:00
}
2008-07-28 15:32:51 -04:00
2012-12-07 09:28:54 +00:00
spin_lock_init ( & root - > root_item_lock ) ;
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
btrfs_qgroup_init_swapped_blocks ( & root - > swapped_blocks ) ;
2020-01-24 09:33:00 -05:00
# ifdef CONFIG_BTRFS_DEBUG
INIT_LIST_HEAD ( & root - > leak_list ) ;
2022-07-15 13:59:21 +02:00
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
2020-01-24 09:33:00 -05:00
list_add_tail ( & root - > leak_list , & fs_info - > allocated_roots ) ;
2022-07-15 13:59:21 +02:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2020-01-24 09:33:00 -05:00
# endif
2007-03-13 16:47:54 -04:00
}
2016-02-11 11:01:55 +01:00
static struct btrfs_root * btrfs_alloc_root ( struct btrfs_fs_info * fs_info ,
2020-01-24 09:32:18 -05:00
u64 objectid , gfp_t flags )
2011-11-17 00:46:16 -05:00
{
2016-02-11 11:01:55 +01:00
struct btrfs_root * root = kzalloc ( sizeof ( * root ) , flags ) ;
2011-11-17 00:46:16 -05:00
if ( root )
2020-01-24 09:32:18 -05:00
__setup_root ( root , fs_info , objectid ) ;
2011-11-17 00:46:16 -05:00
return root ;
}
2013-09-19 16:07:01 -04:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
/* Should only be used by the testing infrastructure */
2016-06-15 09:22:56 -04:00
struct btrfs_root * btrfs_alloc_dummy_root ( struct btrfs_fs_info * fs_info )
2013-09-19 16:07:01 -04:00
{
struct btrfs_root * root ;
2016-06-20 14:14:09 -04:00
if ( ! fs_info )
return ERR_PTR ( - EINVAL ) ;
2020-01-24 09:32:18 -05:00
root = btrfs_alloc_root ( fs_info , BTRFS_ROOT_TREE_OBJECTID , GFP_KERNEL ) ;
2013-09-19 16:07:01 -04:00
if ( ! root )
return ERR_PTR ( - ENOMEM ) ;
2016-06-15 09:22:56 -04:00
2016-06-01 19:18:25 +08:00
/* We don't use the stripesize in selftest, set it as sectorsize */
2014-05-07 17:06:09 -04:00
root - > alloc_bytenr = 0 ;
2013-09-19 16:07:01 -04:00
return root ;
}
# endif
2021-11-05 16:45:51 -04:00
static int global_root_cmp ( struct rb_node * a_node , const struct rb_node * b_node )
{
const struct btrfs_root * a = rb_entry ( a_node , struct btrfs_root , rb_node ) ;
const struct btrfs_root * b = rb_entry ( b_node , struct btrfs_root , rb_node ) ;
return btrfs_comp_cpu_keys ( & a - > root_key , & b - > root_key ) ;
}
static int global_root_key_cmp ( const void * k , const struct rb_node * node )
{
const struct btrfs_key * key = k ;
const struct btrfs_root * root = rb_entry ( node , struct btrfs_root , rb_node ) ;
return btrfs_comp_cpu_keys ( key , & root - > root_key ) ;
}
int btrfs_global_root_insert ( struct btrfs_root * root )
{
struct btrfs_fs_info * fs_info = root - > fs_info ;
struct rb_node * tmp ;
btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.
The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0
BTRFS error (device loop0: state C): failed to load root csum
assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3664!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
load_global_roots fs/btrfs/disk-io.c:2501 [inline]
btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
legacy_get_tree+0xea/0x180 fs/fs_context.c:610
vfs_get_tree+0x88/0x270 fs/super.c:1530
fc_mount fs/namespace.c:1043 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884
[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.
So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.
And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().
But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.
So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.
[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.
To make further debugging easier, also add two explicit error messages:
- Error message for conflicting global roots
- Error message when using backup roots slot
Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-11 08:09:13 +08:00
int ret = 0 ;
2021-11-05 16:45:51 -04:00
write_lock ( & fs_info - > global_root_lock ) ;
tmp = rb_find_add ( & root - > rb_node , & fs_info - > global_root_tree , global_root_cmp ) ;
write_unlock ( & fs_info - > global_root_lock ) ;
btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.
The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0
BTRFS error (device loop0: state C): failed to load root csum
assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3664!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
load_global_roots fs/btrfs/disk-io.c:2501 [inline]
btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
legacy_get_tree+0xea/0x180 fs/fs_context.c:610
vfs_get_tree+0x88/0x270 fs/super.c:1530
fc_mount fs/namespace.c:1043 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884
[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.
So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.
And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().
But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.
So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.
[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.
To make further debugging easier, also add two explicit error messages:
- Error message for conflicting global roots
- Error message when using backup roots slot
Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-11 08:09:13 +08:00
if ( tmp ) {
ret = - EEXIST ;
btrfs_warn ( fs_info , " global root %llu %llu already exists " ,
2024-04-15 16:16:23 -04:00
btrfs_root_id ( root ) , root - > root_key . offset ) ;
btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.
The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0
BTRFS error (device loop0: state C): failed to load root csum
assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3664!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
load_global_roots fs/btrfs/disk-io.c:2501 [inline]
btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
legacy_get_tree+0xea/0x180 fs/fs_context.c:610
vfs_get_tree+0x88/0x270 fs/super.c:1530
fc_mount fs/namespace.c:1043 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884
[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.
So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.
And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().
But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.
So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.
[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.
To make further debugging easier, also add two explicit error messages:
- Error message for conflicting global roots
- Error message when using backup roots slot
Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-11 08:09:13 +08:00
}
return ret ;
2021-11-05 16:45:51 -04:00
}
void btrfs_global_root_delete ( struct btrfs_root * root )
{
struct btrfs_fs_info * fs_info = root - > fs_info ;
write_lock ( & fs_info - > global_root_lock ) ;
rb_erase ( & root - > rb_node , & fs_info - > global_root_tree ) ;
write_unlock ( & fs_info - > global_root_lock ) ;
}
struct btrfs_root * btrfs_global_root ( struct btrfs_fs_info * fs_info ,
struct btrfs_key * key )
{
struct rb_node * node ;
struct btrfs_root * root = NULL ;
read_lock ( & fs_info - > global_root_lock ) ;
node = rb_find ( key , & fs_info - > global_root_tree , global_root_key_cmp ) ;
if ( node )
root = container_of ( node , struct btrfs_root , rb_node ) ;
read_unlock ( & fs_info - > global_root_lock ) ;
return root ;
}
2021-12-15 15:40:08 -05:00
static u64 btrfs_global_root_id ( struct btrfs_fs_info * fs_info , u64 bytenr )
{
struct btrfs_block_group * block_group ;
u64 ret ;
if ( ! btrfs_fs_incompat ( fs_info , EXTENT_TREE_V2 ) )
return 0 ;
if ( bytenr )
block_group = btrfs_lookup_block_group ( fs_info , bytenr ) ;
else
block_group = btrfs_lookup_first_block_group ( fs_info , bytenr ) ;
ASSERT ( block_group ) ;
if ( ! block_group )
return 0 ;
ret = block_group - > global_root_id ;
btrfs_put_block_group ( block_group ) ;
return ret ;
}
2021-11-05 16:45:51 -04:00
struct btrfs_root * btrfs_csum_root ( struct btrfs_fs_info * fs_info , u64 bytenr )
{
struct btrfs_key key = {
. objectid = BTRFS_CSUM_TREE_OBJECTID ,
. type = BTRFS_ROOT_ITEM_KEY ,
2021-12-15 15:40:08 -05:00
. offset = btrfs_global_root_id ( fs_info , bytenr ) ,
2021-11-05 16:45:51 -04:00
} ;
return btrfs_global_root ( fs_info , & key ) ;
}
struct btrfs_root * btrfs_extent_root ( struct btrfs_fs_info * fs_info , u64 bytenr )
{
struct btrfs_key key = {
. objectid = BTRFS_EXTENT_TREE_OBJECTID ,
. type = BTRFS_ROOT_ITEM_KEY ,
2021-12-15 15:40:08 -05:00
. offset = btrfs_global_root_id ( fs_info , bytenr ) ,
2021-11-05 16:45:51 -04:00
} ;
return btrfs_global_root ( fs_info , & key ) ;
}
2011-09-13 12:44:20 +02:00
struct btrfs_root * btrfs_create_tree ( struct btrfs_trans_handle * trans ,
u64 objectid )
{
2019-03-20 13:20:49 +01:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2011-09-13 12:44:20 +02:00
struct extent_buffer * leaf ;
struct btrfs_root * tree_root = fs_info - > tree_root ;
struct btrfs_root * root ;
struct btrfs_key key ;
2018-12-13 21:16:45 +00:00
unsigned int nofs_flag ;
2011-09-13 12:44:20 +02:00
int ret = 0 ;
2018-12-13 21:16:45 +00:00
/*
* We ' re holding a transaction handle , so use a NOFS memory allocation
* context to avoid deadlock if reclaim happens .
*/
nofs_flag = memalloc_nofs_save ( ) ;
2020-01-24 09:32:18 -05:00
root = btrfs_alloc_root ( fs_info , objectid , GFP_KERNEL ) ;
2018-12-13 21:16:45 +00:00
memalloc_nofs_restore ( nofs_flag ) ;
2011-09-13 12:44:20 +02:00
if ( ! root )
return ERR_PTR ( - ENOMEM ) ;
root - > root_key . objectid = objectid ;
root - > root_key . type = BTRFS_ROOT_ITEM_KEY ;
root - > root_key . offset = 0 ;
2020-08-20 11:46:03 -04:00
leaf = btrfs_alloc_tree_block ( trans , root , 0 , objectid , NULL , 0 , 0 , 0 ,
btrfs: qgroup: track metadata relocation COW with simple quota
Relocation COWs metadata blocks in two cases for the reloc root:
- copying the subvolume root item when creating the reloc root
- copying a btree node when there is a COW during relocation
In both cases, the resulting btree node hits an abnormal code path with
respect to the owner field in its btrfs_header. It first creates the
root item for the new objectid, which populates the reloc root id, and
it at this point that delayed refs are created.
Later, it fully copies the old node into the new node (including the
original owner field) which overwrites it. This results in a simple
quotas mismatch where we run the delayed ref for the reloc root which
has no simple quota effect (reloc root is not an fstree) but when we
ultimately delete the node, the owner is the real original fstree and we
do free the space.
To work around this without tampering with the behavior of relocation,
add a parameter to btrfs_add_tree_block that lets the relocation code
path specify a different owning root than the "operating" root (in this
case, owning root is the real root and the operating root is the reloc
root). These can naturally be plumbed into delayed refs that have the
same concept.
Note that this is a double count in some sense, but a relatively natural
one, as there are really two extents, and the old one will be deleted
soon. This is consistent with how data relocation extents are accounted
by simple quotas.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-21 17:37:23 -07:00
0 , BTRFS_NESTING_NORMAL ) ;
2011-09-13 12:44:20 +02:00
if ( IS_ERR ( leaf ) ) {
ret = PTR_ERR ( leaf ) ;
2013-03-21 04:32:32 +00:00
leaf = NULL ;
2022-10-07 18:33:35 +02:00
goto fail ;
2011-09-13 12:44:20 +02:00
}
root - > node = leaf ;
2023-09-12 13:04:29 +01:00
btrfs_mark_buffer_dirty ( trans , leaf ) ;
2011-09-13 12:44:20 +02:00
root - > commit_root = btrfs_root_node ( root ) ;
2014-04-02 19:51:05 +08:00
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
2011-09-13 12:44:20 +02:00
2020-09-15 21:00:04 +02:00
btrfs_set_root_flags ( & root - > root_item , 0 ) ;
btrfs_set_root_limit ( & root - > root_item , 0 ) ;
2011-09-13 12:44:20 +02:00
btrfs_set_root_bytenr ( & root - > root_item , leaf - > start ) ;
btrfs_set_root_generation ( & root - > root_item , trans - > transid ) ;
btrfs_set_root_level ( & root - > root_item , 0 ) ;
btrfs_set_root_refs ( & root - > root_item , 1 ) ;
btrfs_set_root_used ( & root - > root_item , leaf - > len ) ;
btrfs_set_root_last_snapshot ( & root - > root_item , 0 ) ;
btrfs_set_root_dirid ( & root - > root_item , 0 ) ;
2017-10-31 14:08:16 +08:00
if ( is_fstree ( objectid ) )
2020-02-24 17:37:51 +02:00
generate_random_guid ( root - > root_item . uuid ) ;
else
export_guid ( root - > root_item . uuid , & guid_null ) ;
2020-09-15 21:44:52 +02:00
btrfs_set_root_drop_level ( & root - > root_item , 0 ) ;
2011-09-13 12:44:20 +02:00
btrfs: fix lockdep warning when creating free space tree
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.
The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.
The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.
Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.
The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.
[30.480136] ======================================================
[30.480830] WARNING: possible circular locking dependency detected
[30.481457] 5.9.0-rc8+ #76 Not tainted
[30.481897] ------------------------------------------------------
[30.482500] mount/520 is trying to acquire lock:
[30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.484054]
but task is already holding lock:
[30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.485581]
which lock already depends on the new lock.
[30.486397]
the existing dependency chain (in reverse order) is:
[30.487205]
-> #2 (btrfs-extent-01#2){++++}-{3:3}:
[30.487825] down_read_nested+0x43/0x150
[30.488306] __btrfs_tree_read_lock+0x39/0x180
[30.488868] __btrfs_read_lock_root_node+0x3a/0x50
[30.489477] btrfs_search_slot+0x464/0x9b0
[30.490009] check_committed_ref+0x59/0x1d0
[30.490603] btrfs_cross_ref_exist+0x65/0xb0
[30.491108] run_delalloc_nocow+0x405/0x930
[30.491651] btrfs_run_delalloc_range+0x60/0x6b0
[30.492203] writepage_delalloc+0xd4/0x150
[30.492688] __extent_writepage+0x18d/0x3a0
[30.493199] extent_write_cache_pages+0x2af/0x450
[30.493743] extent_writepages+0x34/0x70
[30.494231] do_writepages+0x31/0xd0
[30.494642] __filemap_fdatawrite_range+0xad/0xe0
[30.495194] btrfs_fdatawrite_range+0x1b/0x50
[30.495677] __btrfs_write_out_cache+0x40d/0x460
[30.496227] btrfs_write_out_cache+0x8b/0x110
[30.496716] btrfs_start_dirty_block_groups+0x211/0x4e0
[30.497317] btrfs_commit_transaction+0xc0/0xba0
[30.497861] sync_filesystem+0x71/0x90
[30.498303] btrfs_remount+0x81/0x433
[30.498767] reconfigure_super+0x9f/0x210
[30.499261] path_mount+0x9d1/0xa30
[30.499722] do_mount+0x55/0x70
[30.500158] __x64_sys_mount+0xc4/0xe0
[30.500616] do_syscall_64+0x33/0x40
[30.501091] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.501629]
-> #1 (btrfs-root-00){++++}-{3:3}:
[30.502241] down_read_nested+0x43/0x150
[30.502727] __btrfs_tree_read_lock+0x39/0x180
[30.503291] __btrfs_read_lock_root_node+0x3a/0x50
[30.503903] btrfs_search_slot+0x464/0x9b0
[30.504405] btrfs_insert_empty_items+0x60/0xa0
[30.504973] btrfs_insert_item+0x60/0xd0
[30.505412] btrfs_create_tree+0x1b6/0x210
[30.505913] btrfs_create_free_space_tree+0x54/0x110
[30.506460] btrfs_mount_rw+0x15d/0x20f
[30.506937] btrfs_remount+0x356/0x433
[30.507369] reconfigure_super+0x9f/0x210
[30.507868] path_mount+0x9d1/0xa30
[30.508264] do_mount+0x55/0x70
[30.508668] __x64_sys_mount+0xc4/0xe0
[30.509186] do_syscall_64+0x33/0x40
[30.509652] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.510271]
-> #0 (btrfs-free-space-00){++++}-{3:3}:
[30.510972] __lock_acquire+0x11ad/0x1b60
[30.511432] lock_acquire+0xa2/0x360
[30.511917] down_read_nested+0x43/0x150
[30.512383] __btrfs_tree_read_lock+0x39/0x180
[30.512947] __btrfs_read_lock_root_node+0x3a/0x50
[30.513455] btrfs_search_slot+0x464/0x9b0
[30.513947] search_free_space_info+0x45/0x90
[30.514465] __add_to_free_space_tree+0x92/0x39d
[30.515010] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.515639] btrfs_mount_rw+0x15d/0x20f
[30.516142] btrfs_remount+0x356/0x433
[30.516538] reconfigure_super+0x9f/0x210
[30.517065] path_mount+0x9d1/0xa30
[30.517438] do_mount+0x55/0x70
[30.517824] __x64_sys_mount+0xc4/0xe0
[30.518293] do_syscall_64+0x33/0x40
[30.518776] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.519335]
other info that might help us debug this:
[30.520210] Chain exists of:
btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
[30.521407] Possible unsafe locking scenario:
[30.522037] CPU0 CPU1
[30.522456] ---- ----
[30.522941] lock(btrfs-extent-01#2);
[30.523311] lock(btrfs-root-00);
[30.523952] lock(btrfs-extent-01#2);
[30.524620] lock(btrfs-free-space-00);
[30.525068]
*** DEADLOCK ***
[30.525669] 5 locks held by mount/520:
[30.526116] #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
[30.527056] #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
[30.527960] #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
[30.529118] #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.530113] #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
[30.531124]
stack backtrace:
[30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
[30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
[30.533215] Call Trace:
[30.533452] dump_stack+0x8d/0xc0
[30.533797] check_noncircular+0x13c/0x150
[30.534233] __lock_acquire+0x11ad/0x1b60
[30.534667] lock_acquire+0xa2/0x360
[30.535063] ? __btrfs_tree_read_lock+0x39/0x180
[30.535525] down_read_nested+0x43/0x150
[30.535939] ? __btrfs_tree_read_lock+0x39/0x180
[30.536400] __btrfs_tree_read_lock+0x39/0x180
[30.536862] __btrfs_read_lock_root_node+0x3a/0x50
[30.537304] btrfs_search_slot+0x464/0x9b0
[30.537713] ? trace_hardirqs_on+0x1c/0xf0
[30.538148] search_free_space_info+0x45/0x90
[30.538572] __add_to_free_space_tree+0x92/0x39d
[30.539071] ? printk+0x48/0x4a
[30.539367] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.539972] btrfs_mount_rw+0x15d/0x20f
[30.540350] btrfs_remount+0x356/0x433
[30.540773] ? shrink_dcache_sb+0xd9/0x100
[30.541203] reconfigure_super+0x9f/0x210
[30.541642] path_mount+0x9d1/0xa30
[30.542040] do_mount+0x55/0x70
[30.542366] __x64_sys_mount+0xc4/0xe0
[30.542822] do_syscall_64+0x33/0x40
[30.543197] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.543691] RIP: 0033:0x7f109f7ab93a
[30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
[30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
[30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
[30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
[30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:27 -08:00
btrfs_tree_unlock ( leaf ) ;
2011-09-13 12:44:20 +02:00
key . objectid = objectid ;
key . type = BTRFS_ROOT_ITEM_KEY ;
key . offset = 0 ;
ret = btrfs_insert_root ( trans , tree_root , & key , & root - > root_item ) ;
if ( ret )
goto fail ;
2013-03-21 04:32:32 +00:00
return root ;
btrfs: fix lockdep warning when creating free space tree
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.
The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.
The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.
Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.
The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.
[30.480136] ======================================================
[30.480830] WARNING: possible circular locking dependency detected
[30.481457] 5.9.0-rc8+ #76 Not tainted
[30.481897] ------------------------------------------------------
[30.482500] mount/520 is trying to acquire lock:
[30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.484054]
but task is already holding lock:
[30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.485581]
which lock already depends on the new lock.
[30.486397]
the existing dependency chain (in reverse order) is:
[30.487205]
-> #2 (btrfs-extent-01#2){++++}-{3:3}:
[30.487825] down_read_nested+0x43/0x150
[30.488306] __btrfs_tree_read_lock+0x39/0x180
[30.488868] __btrfs_read_lock_root_node+0x3a/0x50
[30.489477] btrfs_search_slot+0x464/0x9b0
[30.490009] check_committed_ref+0x59/0x1d0
[30.490603] btrfs_cross_ref_exist+0x65/0xb0
[30.491108] run_delalloc_nocow+0x405/0x930
[30.491651] btrfs_run_delalloc_range+0x60/0x6b0
[30.492203] writepage_delalloc+0xd4/0x150
[30.492688] __extent_writepage+0x18d/0x3a0
[30.493199] extent_write_cache_pages+0x2af/0x450
[30.493743] extent_writepages+0x34/0x70
[30.494231] do_writepages+0x31/0xd0
[30.494642] __filemap_fdatawrite_range+0xad/0xe0
[30.495194] btrfs_fdatawrite_range+0x1b/0x50
[30.495677] __btrfs_write_out_cache+0x40d/0x460
[30.496227] btrfs_write_out_cache+0x8b/0x110
[30.496716] btrfs_start_dirty_block_groups+0x211/0x4e0
[30.497317] btrfs_commit_transaction+0xc0/0xba0
[30.497861] sync_filesystem+0x71/0x90
[30.498303] btrfs_remount+0x81/0x433
[30.498767] reconfigure_super+0x9f/0x210
[30.499261] path_mount+0x9d1/0xa30
[30.499722] do_mount+0x55/0x70
[30.500158] __x64_sys_mount+0xc4/0xe0
[30.500616] do_syscall_64+0x33/0x40
[30.501091] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.501629]
-> #1 (btrfs-root-00){++++}-{3:3}:
[30.502241] down_read_nested+0x43/0x150
[30.502727] __btrfs_tree_read_lock+0x39/0x180
[30.503291] __btrfs_read_lock_root_node+0x3a/0x50
[30.503903] btrfs_search_slot+0x464/0x9b0
[30.504405] btrfs_insert_empty_items+0x60/0xa0
[30.504973] btrfs_insert_item+0x60/0xd0
[30.505412] btrfs_create_tree+0x1b6/0x210
[30.505913] btrfs_create_free_space_tree+0x54/0x110
[30.506460] btrfs_mount_rw+0x15d/0x20f
[30.506937] btrfs_remount+0x356/0x433
[30.507369] reconfigure_super+0x9f/0x210
[30.507868] path_mount+0x9d1/0xa30
[30.508264] do_mount+0x55/0x70
[30.508668] __x64_sys_mount+0xc4/0xe0
[30.509186] do_syscall_64+0x33/0x40
[30.509652] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.510271]
-> #0 (btrfs-free-space-00){++++}-{3:3}:
[30.510972] __lock_acquire+0x11ad/0x1b60
[30.511432] lock_acquire+0xa2/0x360
[30.511917] down_read_nested+0x43/0x150
[30.512383] __btrfs_tree_read_lock+0x39/0x180
[30.512947] __btrfs_read_lock_root_node+0x3a/0x50
[30.513455] btrfs_search_slot+0x464/0x9b0
[30.513947] search_free_space_info+0x45/0x90
[30.514465] __add_to_free_space_tree+0x92/0x39d
[30.515010] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.515639] btrfs_mount_rw+0x15d/0x20f
[30.516142] btrfs_remount+0x356/0x433
[30.516538] reconfigure_super+0x9f/0x210
[30.517065] path_mount+0x9d1/0xa30
[30.517438] do_mount+0x55/0x70
[30.517824] __x64_sys_mount+0xc4/0xe0
[30.518293] do_syscall_64+0x33/0x40
[30.518776] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.519335]
other info that might help us debug this:
[30.520210] Chain exists of:
btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
[30.521407] Possible unsafe locking scenario:
[30.522037] CPU0 CPU1
[30.522456] ---- ----
[30.522941] lock(btrfs-extent-01#2);
[30.523311] lock(btrfs-root-00);
[30.523952] lock(btrfs-extent-01#2);
[30.524620] lock(btrfs-free-space-00);
[30.525068]
*** DEADLOCK ***
[30.525669] 5 locks held by mount/520:
[30.526116] #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
[30.527056] #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
[30.527960] #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
[30.529118] #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.530113] #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
[30.531124]
stack backtrace:
[30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
[30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
[30.533215] Call Trace:
[30.533452] dump_stack+0x8d/0xc0
[30.533797] check_noncircular+0x13c/0x150
[30.534233] __lock_acquire+0x11ad/0x1b60
[30.534667] lock_acquire+0xa2/0x360
[30.535063] ? __btrfs_tree_read_lock+0x39/0x180
[30.535525] down_read_nested+0x43/0x150
[30.535939] ? __btrfs_tree_read_lock+0x39/0x180
[30.536400] __btrfs_tree_read_lock+0x39/0x180
[30.536862] __btrfs_read_lock_root_node+0x3a/0x50
[30.537304] btrfs_search_slot+0x464/0x9b0
[30.537713] ? trace_hardirqs_on+0x1c/0xf0
[30.538148] search_free_space_info+0x45/0x90
[30.538572] __add_to_free_space_tree+0x92/0x39d
[30.539071] ? printk+0x48/0x4a
[30.539367] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.539972] btrfs_mount_rw+0x15d/0x20f
[30.540350] btrfs_remount+0x356/0x433
[30.540773] ? shrink_dcache_sb+0xd9/0x100
[30.541203] reconfigure_super+0x9f/0x210
[30.541642] path_mount+0x9d1/0xa30
[30.542040] do_mount+0x55/0x70
[30.542366] __x64_sys_mount+0xc4/0xe0
[30.542822] do_syscall_64+0x33/0x40
[30.543197] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.543691] RIP: 0033:0x7f109f7ab93a
[30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
[30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
[30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
[30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
[30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:27 -08:00
fail :
2020-01-24 09:33:01 -05:00
btrfs_put_root ( root ) ;
2011-09-13 12:44:20 +02:00
2013-03-21 04:32:32 +00:00
return ERR_PTR ( ret ) ;
2011-09-13 12:44:20 +02:00
}
2009-01-21 12:54:03 -05:00
static struct btrfs_root * alloc_log_tree ( struct btrfs_trans_handle * trans ,
struct btrfs_fs_info * fs_info )
2007-04-09 10:42:37 -04:00
{
struct btrfs_root * root ;
2008-09-05 16:13:11 -04:00
2020-01-24 09:32:18 -05:00
root = btrfs_alloc_root ( fs_info , BTRFS_TREE_LOG_OBJECTID , GFP_NOFS ) ;
2008-09-05 16:13:11 -04:00
if ( ! root )
2009-01-21 12:54:03 -05:00
return ERR_PTR ( - ENOMEM ) ;
2008-09-05 16:13:11 -04:00
root - > root_key . objectid = BTRFS_TREE_LOG_OBJECTID ;
root - > root_key . type = BTRFS_ROOT_ITEM_KEY ;
root - > root_key . offset = BTRFS_TREE_LOG_OBJECTID ;
2014-04-02 19:51:05 +08:00
2021-02-04 19:22:17 +09:00
return root ;
}
int btrfs_alloc_log_tree_node ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
struct extent_buffer * leaf ;
2009-01-21 12:54:03 -05:00
/*
2020-05-15 14:01:40 +08:00
* DON ' T set SHAREABLE bit for log trees .
2014-04-02 19:51:05 +08:00
*
2020-05-15 14:01:40 +08:00
* Log trees are not exposed to user space thus can ' t be snapshotted ,
* and they go away before a real commit is actually done .
*
* They do store pointers to file data extents , and those reference
* counts still get updated ( along with back refs to the log tree ) .
2009-01-21 12:54:03 -05:00
*/
2008-09-05 16:13:11 -04:00
2014-06-15 01:54:12 +02:00
leaf = btrfs_alloc_tree_block ( trans , root , 0 , BTRFS_TREE_LOG_OBJECTID ,
btrfs: qgroup: track metadata relocation COW with simple quota
Relocation COWs metadata blocks in two cases for the reloc root:
- copying the subvolume root item when creating the reloc root
- copying a btree node when there is a COW during relocation
In both cases, the resulting btree node hits an abnormal code path with
respect to the owner field in its btrfs_header. It first creates the
root item for the new objectid, which populates the reloc root id, and
it at this point that delayed refs are created.
Later, it fully copies the old node into the new node (including the
original owner field) which overwrites it. This results in a simple
quotas mismatch where we run the delayed ref for the reloc root which
has no simple quota effect (reloc root is not an fstree) but when we
ultimately delete the node, the owner is the real original fstree and we
do free the space.
To work around this without tampering with the behavior of relocation,
add a parameter to btrfs_add_tree_block that lets the relocation code
path specify a different owning root than the "operating" root (in this
case, owning root is the real root and the operating root is the reloc
root). These can naturally be plumbed into delayed refs that have the
same concept.
Note that this is a double count in some sense, but a relatively natural
one, as there are really two extents, and the old one will be deleted
soon. This is consistent with how data relocation extents are accounted
by simple quotas.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-21 17:37:23 -07:00
NULL , 0 , 0 , 0 , 0 , BTRFS_NESTING_NORMAL ) ;
2021-02-04 19:22:17 +09:00
if ( IS_ERR ( leaf ) )
return PTR_ERR ( leaf ) ;
2008-09-05 16:13:11 -04:00
2009-01-21 12:54:03 -05:00
root - > node = leaf ;
2008-09-05 16:13:11 -04:00
2023-09-12 13:04:29 +01:00
btrfs_mark_buffer_dirty ( trans , root - > node ) ;
2008-09-05 16:13:11 -04:00
btrfs_tree_unlock ( root - > node ) ;
2021-02-04 19:22:17 +09:00
return 0 ;
2009-01-21 12:54:03 -05:00
}
int btrfs_init_log_root_tree ( struct btrfs_trans_handle * trans ,
struct btrfs_fs_info * fs_info )
{
struct btrfs_root * log_root ;
log_root = alloc_log_tree ( trans , fs_info ) ;
if ( IS_ERR ( log_root ) )
return PTR_ERR ( log_root ) ;
2021-02-04 19:22:17 +09:00
2021-02-04 19:22:20 +09:00
if ( ! btrfs_is_zoned ( fs_info ) ) {
int ret = btrfs_alloc_log_tree_node ( trans , log_root ) ;
if ( ret ) {
btrfs_put_root ( log_root ) ;
return ret ;
}
2021-02-04 19:22:17 +09:00
}
2009-01-21 12:54:03 -05:00
WARN_ON ( fs_info - > log_root_tree ) ;
fs_info - > log_root_tree = log_root ;
return 0 ;
}
int btrfs_add_log_tree ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root )
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2009-01-21 12:54:03 -05:00
struct btrfs_root * log_root ;
struct btrfs_inode_item * inode_item ;
2021-02-04 19:22:17 +09:00
int ret ;
2009-01-21 12:54:03 -05:00
2016-06-22 18:54:23 -04:00
log_root = alloc_log_tree ( trans , fs_info ) ;
2009-01-21 12:54:03 -05:00
if ( IS_ERR ( log_root ) )
return PTR_ERR ( log_root ) ;
2021-02-04 19:22:17 +09:00
ret = btrfs_alloc_log_tree_node ( trans , log_root ) ;
if ( ret ) {
btrfs_put_root ( log_root ) ;
return ret ;
}
2024-07-01 10:51:28 +01:00
btrfs_set_root_last_trans ( log_root , trans - > transid ) ;
2024-04-15 16:16:23 -04:00
log_root - > root_key . offset = btrfs_root_id ( root ) ;
2009-01-21 12:54:03 -05:00
inode_item = & log_root - > root_item . inode ;
2013-07-16 11:19:18 +08:00
btrfs_set_stack_inode_generation ( inode_item , 1 ) ;
btrfs_set_stack_inode_size ( inode_item , 3 ) ;
btrfs_set_stack_inode_nlink ( inode_item , 1 ) ;
2016-06-15 09:22:56 -04:00
btrfs_set_stack_inode_nbytes ( inode_item ,
2016-06-22 18:54:23 -04:00
fs_info - > nodesize ) ;
2013-07-16 11:19:18 +08:00
btrfs_set_stack_inode_mode ( inode_item , S_IFDIR | 0755 ) ;
2009-01-21 12:54:03 -05:00
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
btrfs_set_root_node ( & log_root - > root_item , log_root - > node ) ;
2009-01-21 12:54:03 -05:00
WARN_ON ( root - > log_root ) ;
root - > log_root = log_root ;
2023-10-04 11:38:49 +01:00
btrfs_set_root_log_transid ( root , 0 ) ;
2014-02-20 18:08:59 +08:00
root - > log_transid_committed = - 1 ;
2023-10-04 11:38:48 +01:00
btrfs_set_root_last_log_commit ( root , 0 ) ;
2008-09-05 16:13:11 -04:00
return 0 ;
}
2020-10-19 16:02:31 -04:00
static struct btrfs_root * read_tree_root_path ( struct btrfs_root * tree_root ,
struct btrfs_path * path ,
2024-05-30 19:14:12 +02:00
const struct btrfs_key * key )
2008-09-05 16:13:11 -04:00
{
struct btrfs_root * root ;
2022-09-14 13:32:50 +08:00
struct btrfs_tree_parent_check check = { 0 } ;
2008-09-05 16:13:11 -04:00
struct btrfs_fs_info * fs_info = tree_root - > fs_info ;
2008-10-29 14:49:05 -04:00
u64 generation ;
2013-05-15 07:48:19 +00:00
int ret ;
2018-03-29 09:08:11 +08:00
int level ;
2007-04-09 10:42:37 -04:00
2020-01-24 09:32:18 -05:00
root = btrfs_alloc_root ( fs_info , key - > objectid , GFP_NOFS ) ;
2020-10-19 16:02:31 -04:00
if ( ! root )
return ERR_PTR ( - ENOMEM ) ;
2007-04-09 10:42:37 -04:00
2013-05-15 07:48:19 +00:00
ret = btrfs_find_root ( tree_root , key , path ,
& root - > root_item , & root - > root_key ) ;
2007-04-09 10:42:37 -04:00
if ( ret ) {
2009-09-21 15:56:00 -04:00
if ( ret > 0 )
ret = - ENOENT ;
2020-10-19 16:02:31 -04:00
goto fail ;
2007-04-09 10:42:37 -04:00
}
2009-09-21 15:56:00 -04:00
2008-10-29 14:49:05 -04:00
generation = btrfs_root_generation ( & root - > root_item ) ;
2018-03-29 09:08:11 +08:00
level = btrfs_root_level ( & root - > root_item ) ;
2022-09-14 13:32:50 +08:00
check . level = level ;
check . transid = generation ;
check . owner_root = key - > objectid ;
root - > node = read_tree_block ( fs_info , btrfs_root_bytenr ( & root - > root_item ) ,
& check ) ;
2015-05-25 17:30:15 +08:00
if ( IS_ERR ( root - > node ) ) {
ret = PTR_ERR ( root - > node ) ;
2020-02-14 16:11:42 -05:00
root - > node = NULL ;
2020-10-19 16:02:31 -04:00
goto fail ;
2022-02-22 15:41:19 +08:00
}
if ( ! btrfs_buffer_uptodate ( root - > node , generation , 0 ) ) {
2013-05-15 07:48:19 +00:00
ret = - EIO ;
2020-10-19 16:02:31 -04:00
goto fail ;
2013-04-23 14:17:42 -04:00
}
btrfs: tree-checker: check extent buffer owner against owner rootid
Btrfs doesn't check whether the tree block respects the root owner.
This means, if a tree block referred by a parent in extent tree, but has
owner of 5, btrfs can still continue reading the tree block, as long as
it doesn't trigger other sanity checks.
Normally this is fine, but combined with the empty tree check in
check_leaf(), if we hit an empty extent tree, but the root node has
csum tree owner, we can let such extent buffer to sneak in.
Shrink the hole by:
- Do extra eb owner check at tree read time
- Make sure the root owner extent buffer exactly matches the root id.
Unfortunately we can't yet completely patch the hole, there are several
call sites can't pass all info we need:
- For reloc/log trees
Their owner is key::offset, not key::objectid.
We need the full root key to do that accurate check.
For now, we just skip the ownership check for those trees.
- For add_data_references() of relocation
That call site doesn't have any parent/ownership info, as all the
bytenrs are all from btrfs_find_all_leafs().
- For direct backref items walk
Direct backref items records the parent bytenr directly, thus unlike
indirect backref item, we don't do a full tree search.
Thus in that case, we don't have full parent owner to check.
For the later two cases, they all pass 0 as @owner_root, thus we can
skip those cases if @owner_root is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-16 08:05:58 +08:00
/*
* For real fs , and not log / reloc trees , root owner must
* match its root node owner
*/
2024-04-18 00:47:13 +02:00
if ( ! btrfs_is_testing ( fs_info ) & &
2024-04-15 16:16:23 -04:00
btrfs_root_id ( root ) ! = BTRFS_TREE_LOG_OBJECTID & &
btrfs_root_id ( root ) ! = BTRFS_TREE_RELOC_OBJECTID & &
btrfs_root_id ( root ) ! = btrfs_header_owner ( root - > node ) ) {
btrfs: tree-checker: check extent buffer owner against owner rootid
Btrfs doesn't check whether the tree block respects the root owner.
This means, if a tree block referred by a parent in extent tree, but has
owner of 5, btrfs can still continue reading the tree block, as long as
it doesn't trigger other sanity checks.
Normally this is fine, but combined with the empty tree check in
check_leaf(), if we hit an empty extent tree, but the root node has
csum tree owner, we can let such extent buffer to sneak in.
Shrink the hole by:
- Do extra eb owner check at tree read time
- Make sure the root owner extent buffer exactly matches the root id.
Unfortunately we can't yet completely patch the hole, there are several
call sites can't pass all info we need:
- For reloc/log trees
Their owner is key::offset, not key::objectid.
We need the full root key to do that accurate check.
For now, we just skip the ownership check for those trees.
- For add_data_references() of relocation
That call site doesn't have any parent/ownership info, as all the
bytenrs are all from btrfs_find_all_leafs().
- For direct backref items walk
Direct backref items records the parent bytenr directly, thus unlike
indirect backref item, we don't do a full tree search.
Thus in that case, we don't have full parent owner to check.
For the later two cases, they all pass 0 as @owner_root, thus we can
skip those cases if @owner_root is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-16 08:05:58 +08:00
btrfs_crit ( fs_info ,
" root=%llu block=%llu, tree root owner mismatch, have %llu expect %llu " ,
2024-04-15 16:16:23 -04:00
btrfs_root_id ( root ) , root - > node - > start ,
btrfs: tree-checker: check extent buffer owner against owner rootid
Btrfs doesn't check whether the tree block respects the root owner.
This means, if a tree block referred by a parent in extent tree, but has
owner of 5, btrfs can still continue reading the tree block, as long as
it doesn't trigger other sanity checks.
Normally this is fine, but combined with the empty tree check in
check_leaf(), if we hit an empty extent tree, but the root node has
csum tree owner, we can let such extent buffer to sneak in.
Shrink the hole by:
- Do extra eb owner check at tree read time
- Make sure the root owner extent buffer exactly matches the root id.
Unfortunately we can't yet completely patch the hole, there are several
call sites can't pass all info we need:
- For reloc/log trees
Their owner is key::offset, not key::objectid.
We need the full root key to do that accurate check.
For now, we just skip the ownership check for those trees.
- For add_data_references() of relocation
That call site doesn't have any parent/ownership info, as all the
bytenrs are all from btrfs_find_all_leafs().
- For direct backref items walk
Direct backref items records the parent bytenr directly, thus unlike
indirect backref item, we don't do a full tree search.
Thus in that case, we don't have full parent owner to check.
For the later two cases, they all pass 0 as @owner_root, thus we can
skip those cases if @owner_root is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-16 08:05:58 +08:00
btrfs_header_owner ( root - > node ) ,
2024-04-15 16:16:23 -04:00
btrfs_root_id ( root ) ) ;
btrfs: tree-checker: check extent buffer owner against owner rootid
Btrfs doesn't check whether the tree block respects the root owner.
This means, if a tree block referred by a parent in extent tree, but has
owner of 5, btrfs can still continue reading the tree block, as long as
it doesn't trigger other sanity checks.
Normally this is fine, but combined with the empty tree check in
check_leaf(), if we hit an empty extent tree, but the root node has
csum tree owner, we can let such extent buffer to sneak in.
Shrink the hole by:
- Do extra eb owner check at tree read time
- Make sure the root owner extent buffer exactly matches the root id.
Unfortunately we can't yet completely patch the hole, there are several
call sites can't pass all info we need:
- For reloc/log trees
Their owner is key::offset, not key::objectid.
We need the full root key to do that accurate check.
For now, we just skip the ownership check for those trees.
- For add_data_references() of relocation
That call site doesn't have any parent/ownership info, as all the
bytenrs are all from btrfs_find_all_leafs().
- For direct backref items walk
Direct backref items records the parent bytenr directly, thus unlike
indirect backref item, we don't do a full tree search.
Thus in that case, we don't have full parent owner to check.
For the later two cases, they all pass 0 as @owner_root, thus we can
skip those cases if @owner_root is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-16 08:05:58 +08:00
ret = - EUCLEAN ;
goto fail ;
}
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
root - > commit_root = btrfs_root_node ( root ) ;
2013-05-15 07:48:19 +00:00
return root ;
2020-10-19 16:02:31 -04:00
fail :
2020-01-24 09:33:01 -05:00
btrfs_put_root ( root ) ;
2020-10-19 16:02:31 -04:00
return ERR_PTR ( ret ) ;
}
struct btrfs_root * btrfs_read_tree_root ( struct btrfs_root * tree_root ,
2024-05-30 19:14:12 +02:00
const struct btrfs_key * key )
2020-10-19 16:02:31 -04:00
{
struct btrfs_root * root ;
struct btrfs_path * path ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return ERR_PTR ( - ENOMEM ) ;
root = read_tree_root_path ( tree_root , path , key ) ;
btrfs_free_path ( path ) ;
return root ;
2013-05-15 07:48:19 +00:00
}
2020-06-16 10:17:36 +08:00
/*
* Initialize subvolume root in - memory structure
*
* @ anon_dev : anonymous device to attach to the root , if zero , allocate new
*/
static int btrfs_init_fs_root ( struct btrfs_root * root , dev_t anon_dev )
2013-05-15 07:48:19 +00:00
{
int ret ;
2023-03-01 21:47:08 +01:00
btrfs_drew_lock_init ( & root - > snapshot_lock ) ;
2014-03-06 13:38:19 +08:00
2024-04-15 16:16:23 -04:00
if ( btrfs_root_id ( root ) ! = BTRFS_TREE_LOG_OBJECTID & &
btrfs: reject invalid reloc tree root keys with stack dump
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
That ASSERT() makes sure the reloc tree is properly pointed back by its
subvolume tree.
[CAUSE]
After more debugging output, it turns out we had an invalid reloc tree:
BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM,
QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
But reloc trees can only exist for subvolumes, as for non-subvolume
trees, we just COW the involved tree block, no need to create a reloc
tree since those tree blocks won't be shared with other trees.
Only subvolumes tree can share tree blocks with other trees (thus they
have BTRFS_ROOT_SHAREABLE flag).
Thus this new debug output proves my previous assumption that corrupted
on-disk data can trigger that ASSERT().
[FIX]
Besides the dedicated fix and the graceful exit, also let tree-checker to
check such root keys, to make sure reloc trees can only exist for subvolumes.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 17:20:43 +08:00
! btrfs_is_data_reloc_root ( root ) & &
2024-04-15 16:16:23 -04:00
is_fstree ( btrfs_root_id ( root ) ) ) {
2020-05-15 14:01:40 +08:00
set_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) ;
2020-01-24 09:32:19 -05:00
btrfs_check_and_init_root_item ( & root - > root_item ) ;
}
2020-06-16 10:17:34 +08:00
/*
* Don ' t assign anonymous block device to roots that are not exposed to
* userspace , the id pool is limited to 1 M
*/
2024-04-15 16:16:23 -04:00
if ( is_fstree ( btrfs_root_id ( root ) ) & &
2020-06-16 10:17:34 +08:00
btrfs_root_refs ( & root - > root_item ) > 0 ) {
2020-06-16 10:17:36 +08:00
if ( ! anon_dev ) {
ret = get_anon_bdev ( & root - > anon_dev ) ;
if ( ret )
goto fail ;
} else {
root - > anon_dev = anon_dev ;
}
2020-06-16 10:17:34 +08:00
}
2016-01-07 18:56:59 +05:30
mutex_lock ( & root - > objectid_mutex ) ;
2020-12-07 17:32:32 +02:00
ret = btrfs_init_root_free_objectid ( root ) ;
2016-01-07 18:56:59 +05:30
if ( ret ) {
mutex_unlock ( & root - > objectid_mutex ) ;
2016-06-28 13:44:38 -07:00
goto fail ;
2016-01-07 18:56:59 +05:30
}
2020-12-07 17:32:35 +02:00
ASSERT ( root - > free_objectid < = BTRFS_LAST_FREE_OBJECTID ) ;
2016-01-07 18:56:59 +05:30
mutex_unlock ( & root - > objectid_mutex ) ;
2013-05-15 07:48:19 +00:00
return 0 ;
fail :
2018-07-20 16:30:25 +02:00
/* The caller is responsible to call btrfs_free_fs_root */
2013-05-15 07:48:19 +00:00
return ret ;
}
2020-01-24 09:32:25 -05:00
static struct btrfs_root * btrfs_lookup_fs_root ( struct btrfs_fs_info * fs_info ,
u64 root_id )
2013-05-15 07:48:19 +00:00
{
struct btrfs_root * root ;
2022-07-15 13:59:21 +02:00
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
root = radix_tree_lookup ( & fs_info - > fs_roots_radix ,
( unsigned long ) root_id ) ;
2023-05-23 10:40:20 +02:00
root = btrfs_grab_root ( root ) ;
2022-07-15 13:59:21 +02:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2013-05-15 07:48:19 +00:00
return root ;
}
2020-10-19 16:02:31 -04:00
static struct btrfs_root * btrfs_get_global_root ( struct btrfs_fs_info * fs_info ,
u64 objectid )
{
2021-11-05 16:45:51 -04:00
struct btrfs_key key = {
. objectid = objectid ,
. type = BTRFS_ROOT_ITEM_KEY ,
. offset = 0 ,
} ;
2023-05-23 10:40:19 +02:00
switch ( objectid ) {
case BTRFS_ROOT_TREE_OBJECTID :
2020-10-19 16:02:31 -04:00
return btrfs_grab_root ( fs_info - > tree_root ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_EXTENT_TREE_OBJECTID :
2021-11-05 16:45:51 -04:00
return btrfs_grab_root ( btrfs_global_root ( fs_info , & key ) ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_CHUNK_TREE_OBJECTID :
2020-10-19 16:02:31 -04:00
return btrfs_grab_root ( fs_info - > chunk_root ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_DEV_TREE_OBJECTID :
2020-10-19 16:02:31 -04:00
return btrfs_grab_root ( fs_info - > dev_root ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_CSUM_TREE_OBJECTID :
2021-11-05 16:45:51 -04:00
return btrfs_grab_root ( btrfs_global_root ( fs_info , & key ) ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_QUOTA_TREE_OBJECTID :
2023-05-23 10:40:18 +02:00
return btrfs_grab_root ( fs_info - > quota_root ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_UUID_TREE_OBJECTID :
2023-05-23 10:40:18 +02:00
return btrfs_grab_root ( fs_info - > uuid_root ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_BLOCK_GROUP_TREE_OBJECTID :
2023-05-23 10:40:18 +02:00
return btrfs_grab_root ( fs_info - > block_group_root ) ;
2023-05-23 10:40:19 +02:00
case BTRFS_FREE_SPACE_TREE_OBJECTID :
2023-05-23 10:40:18 +02:00
return btrfs_grab_root ( btrfs_global_root ( fs_info , & key ) ) ;
2023-09-14 09:06:57 -07:00
case BTRFS_RAID_STRIPE_TREE_OBJECTID :
return btrfs_grab_root ( fs_info - > stripe_root ) ;
2023-05-23 10:40:19 +02:00
default :
return NULL ;
}
2020-10-19 16:02:31 -04:00
}
2013-05-15 07:48:19 +00:00
int btrfs_insert_fs_root ( struct btrfs_fs_info * fs_info ,
struct btrfs_root * root )
{
int ret ;
2022-07-15 13:59:21 +02:00
ret = radix_tree_preload ( GFP_NOFS ) ;
if ( ret )
return ret ;
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
ret = radix_tree_insert ( & fs_info - > fs_roots_radix ,
2024-04-15 16:16:23 -04:00
( unsigned long ) btrfs_root_id ( root ) ,
2022-07-15 13:59:21 +02:00
root ) ;
2020-01-24 09:32:27 -05:00
if ( ret = = 0 ) {
2020-01-24 09:33:01 -05:00
btrfs_grab_root ( root ) ;
2022-07-15 13:59:21 +02:00
set_bit ( BTRFS_ROOT_IN_RADIX , & root - > state ) ;
2020-01-24 09:32:27 -05:00
}
2022-07-15 13:59:21 +02:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
radix_tree_preload_end ( ) ;
2013-05-15 07:48:19 +00:00
return ret ;
}
2024-05-30 19:14:12 +02:00
void btrfs_check_leaked_roots ( const struct btrfs_fs_info * fs_info )
2020-01-24 09:33:00 -05:00
{
# ifdef CONFIG_BTRFS_DEBUG
struct btrfs_root * root ;
while ( ! list_empty ( & fs_info - > allocated_roots ) ) {
2020-09-03 14:29:51 -04:00
char buf [ BTRFS_ROOT_NAME_BUF_LEN ] ;
2020-01-24 09:33:00 -05:00
root = list_first_entry ( & fs_info - > allocated_roots ,
struct btrfs_root , leak_list ) ;
2020-09-03 14:29:51 -04:00
btrfs_err ( fs_info , " leaked root %s refcount %d " ,
2020-12-16 11:18:44 -05:00
btrfs_root_name ( & root - > root_key , buf ) ,
2020-01-24 09:33:00 -05:00
refcount_read ( & root - > refs ) ) ;
2024-01-02 15:18:07 -05:00
WARN_ON_ONCE ( 1 ) ;
2020-01-24 09:33:00 -05:00
while ( refcount_read ( & root - > refs ) > 1 )
2020-01-24 09:33:01 -05:00
btrfs_put_root ( root ) ;
btrfs_put_root ( root ) ;
2020-01-24 09:33:00 -05:00
}
# endif
}
2021-11-05 16:45:51 -04:00
static void free_global_roots ( struct btrfs_fs_info * fs_info )
{
struct btrfs_root * root ;
struct rb_node * node ;
while ( ( node = rb_first_postorder ( & fs_info - > global_root_tree ) ) ! = NULL ) {
root = rb_entry ( node , struct btrfs_root , rb_node ) ;
rb_erase ( & root - > rb_node , & fs_info - > global_root_tree ) ;
btrfs_put_root ( root ) ;
}
}
2020-01-24 09:32:53 -05:00
void btrfs_free_fs_info ( struct btrfs_fs_info * fs_info )
{
2024-03-22 18:02:59 +00:00
struct percpu_counter * em_counter = & fs_info - > evictable_extent_maps ;
2020-01-24 09:32:57 -05:00
percpu_counter_destroy ( & fs_info - > dirty_metadata_bytes ) ;
percpu_counter_destroy ( & fs_info - > delalloc_bytes ) ;
2020-10-09 09:28:20 -04:00
percpu_counter_destroy ( & fs_info - > ordered_bytes ) ;
2024-03-22 18:02:59 +00:00
if ( percpu_counter_initialized ( em_counter ) )
ASSERT ( percpu_counter_sum_positive ( em_counter ) = = 0 ) ;
percpu_counter_destroy ( em_counter ) ;
2020-01-24 09:32:57 -05:00
percpu_counter_destroy ( & fs_info - > dev_replace . bio_counter ) ;
btrfs_free_csum_hash ( fs_info ) ;
btrfs_free_stripe_hash_table ( fs_info ) ;
btrfs_free_ref_cache ( fs_info ) ;
2020-01-24 09:32:53 -05:00
kfree ( fs_info - > balance_ctl ) ;
kfree ( fs_info - > delayed_root ) ;
2021-11-05 16:45:51 -04:00
free_global_roots ( fs_info ) ;
2020-01-24 09:33:01 -05:00
btrfs_put_root ( fs_info - > tree_root ) ;
btrfs_put_root ( fs_info - > chunk_root ) ;
btrfs_put_root ( fs_info - > dev_root ) ;
btrfs_put_root ( fs_info - > quota_root ) ;
btrfs_put_root ( fs_info - > uuid_root ) ;
btrfs_put_root ( fs_info - > fs_root ) ;
2020-05-15 14:01:42 +08:00
btrfs_put_root ( fs_info - > data_reloc_root ) ;
2021-12-15 15:40:07 -05:00
btrfs_put_root ( fs_info - > block_group_root ) ;
2023-09-14 09:06:57 -07:00
btrfs_put_root ( fs_info - > stripe_root ) ;
2020-01-24 09:33:00 -05:00
btrfs_check_leaked_roots ( fs_info ) ;
2020-02-14 16:11:40 -05:00
btrfs_extent_buffer_leak_debug_check ( fs_info ) ;
2020-01-24 09:32:53 -05:00
kfree ( fs_info - > super_copy ) ;
kfree ( fs_info - > super_for_commit ) ;
2021-08-17 17:38:51 +08:00
kfree ( fs_info - > subpage_info ) ;
2020-01-24 09:32:53 -05:00
kvfree ( fs_info ) ;
}
2020-06-16 10:17:36 +08:00
/*
* Get an in - memory reference of a root structure .
*
* For essential trees like root / extent tree , we grab it from fs_info directly .
* For subvolume trees , we check the cached filesystem roots first . If not
* found , then read it from disk and add it to cached fs roots .
*
* Caller should release the root by calling btrfs_put_root ( ) after the usage .
*
* NOTE : Reloc and log trees can ' t be read by this function as they share the
* same root objectid .
*
* @ objectid : root id
* @ anon_dev : preallocated anonymous block device number for new roots ,
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
* pass NULL for a new allocation .
2020-06-16 10:17:36 +08:00
* @ check_ref : whether to check root item references , If true , return - ENOENT
* for orphan roots
*/
static struct btrfs_root * btrfs_get_root_ref ( struct btrfs_fs_info * fs_info ,
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
u64 objectid , dev_t * anon_dev ,
2020-06-16 10:17:36 +08:00
bool check_ref )
2007-06-22 14:16:25 -04:00
{
struct btrfs_root * root ;
2015-01-02 18:45:16 +01:00
struct btrfs_path * path ;
2015-01-02 19:36:14 +01:00
struct btrfs_key key ;
2007-06-22 14:16:25 -04:00
int ret ;
2020-10-19 16:02:31 -04:00
root = btrfs_get_global_root ( fs_info , objectid ) ;
if ( root )
return root ;
2023-08-03 17:20:41 +08:00
/*
* If we ' re called for non - subvolume trees , and above function didn ' t
* find one , do not try to read it from disk .
*
* This is namely for free - space - tree and quota tree , which can change
* at runtime and should only be grabbed from fs_info .
*/
if ( ! is_fstree ( objectid ) & & objectid ! = BTRFS_DATA_RELOC_TREE_OBJECTID )
return ERR_PTR ( - ENOENT ) ;
2009-09-21 15:56:00 -04:00
again :
2020-05-15 19:35:55 +02:00
root = btrfs_lookup_fs_root ( fs_info , objectid ) ;
2013-08-23 10:34:42 +02:00
if ( root ) {
btrfs: do not ASSERT() if the newly created subvolume already got read
[BUG]
There is a syzbot crash, triggered by the ASSERT() during subvolume
creation:
assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:1319!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60
<TASK>
btrfs_get_new_fs_root+0xd3/0xf0
create_subvol+0xd02/0x1650
btrfs_mksubvol+0xe95/0x12b0
__btrfs_ioctl_snap_create+0x2f9/0x4f0
btrfs_ioctl_snap_create+0x16b/0x200
btrfs_ioctl+0x35f0/0x5cf0
__x64_sys_ioctl+0x19d/0x210
do_syscall_64+0x3f/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
---[ end trace 0000000000000000 ]---
[CAUSE]
During create_subvol(), after inserting root item for the newly created
subvolume, we would trigger btrfs_get_new_fs_root() to get the
btrfs_root of that subvolume.
The idea here is, we have preallocated an anonymous device number for
the subvolume, thus we can assign it to the new subvolume.
But there is really nothing preventing things like backref walk to read
the new subvolume.
If that happens before we call btrfs_get_new_fs_root(), the subvolume
would be read out, with a new anonymous device number assigned already.
In that case, we would trigger ASSERT(), as we really expect no one to
read out that subvolume (which is not yet accessible from the fs).
But things like backref walk is still possible to trigger the read on
the subvolume.
Thus our assumption on the ASSERT() is not correct in the first place.
[FIX]
Fix it by removing the ASSERT(), and just free the @anon_dev, reset it
to 0, and continue.
If the subvolume tree is read out by something else, it should have
already get a new anon_dev assigned thus we only need to free the
preallocated one.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-20 19:41:28 +10:30
/*
* Some other caller may have read out the newly inserted
* subvolume already ( for things like backref walk etc ) . Not
* that common but still possible . In that case , we just need
* to free the anon_dev .
*/
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
if ( unlikely ( anon_dev & & * anon_dev ) ) {
free_anon_bdev ( * anon_dev ) ;
* anon_dev = 0 ;
btrfs: do not ASSERT() if the newly created subvolume already got read
[BUG]
There is a syzbot crash, triggered by the ASSERT() during subvolume
creation:
assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:1319!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60
<TASK>
btrfs_get_new_fs_root+0xd3/0xf0
create_subvol+0xd02/0x1650
btrfs_mksubvol+0xe95/0x12b0
__btrfs_ioctl_snap_create+0x2f9/0x4f0
btrfs_ioctl_snap_create+0x16b/0x200
btrfs_ioctl+0x35f0/0x5cf0
__x64_sys_ioctl+0x19d/0x210
do_syscall_64+0x3f/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
---[ end trace 0000000000000000 ]---
[CAUSE]
During create_subvol(), after inserting root item for the newly created
subvolume, we would trigger btrfs_get_new_fs_root() to get the
btrfs_root of that subvolume.
The idea here is, we have preallocated an anonymous device number for
the subvolume, thus we can assign it to the new subvolume.
But there is really nothing preventing things like backref walk to read
the new subvolume.
If that happens before we call btrfs_get_new_fs_root(), the subvolume
would be read out, with a new anonymous device number assigned already.
In that case, we would trigger ASSERT(), as we really expect no one to
read out that subvolume (which is not yet accessible from the fs).
But things like backref walk is still possible to trigger the read on
the subvolume.
Thus our assumption on the ASSERT() is not correct in the first place.
[FIX]
Fix it by removing the ASSERT(), and just free the @anon_dev, reset it
to 0, and continue.
If the subvolume tree is read out by something else, it should have
already get a new anon_dev assigned thus we only need to free the
preallocated one.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-20 19:41:28 +10:30
}
2020-01-24 09:32:56 -05:00
if ( check_ref & & btrfs_root_refs ( & root - > root_item ) = = 0 ) {
2020-01-24 09:33:01 -05:00
btrfs_put_root ( root ) ;
2013-08-23 10:34:42 +02:00
return ERR_PTR ( - ENOENT ) ;
2020-01-24 09:32:56 -05:00
}
2007-06-22 14:16:25 -04:00
return root ;
2013-08-23 10:34:42 +02:00
}
2007-06-22 14:16:25 -04:00
2020-05-15 19:35:55 +02:00
key . objectid = objectid ;
key . type = BTRFS_ROOT_ITEM_KEY ;
key . offset = ( u64 ) - 1 ;
root = btrfs_read_tree_root ( fs_info - > tree_root , & key ) ;
2007-06-22 14:16:25 -04:00
if ( IS_ERR ( root ) )
return root ;
2008-11-17 20:42:26 -05:00
2013-09-25 21:47:44 +08:00
if ( check_ref & & btrfs_root_refs ( & root - > root_item ) = = 0 ) {
2013-05-15 07:48:19 +00:00
ret = - ENOENT ;
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
goto fail ;
2011-06-13 15:18:23 +00:00
}
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
ret = btrfs_init_fs_root ( root , anon_dev ? * anon_dev : 0 ) ;
2011-06-13 11:28:50 -04:00
if ( ret )
goto fail ;
2008-11-17 20:42:26 -05:00
2015-01-02 18:45:16 +01:00
path = btrfs_alloc_path ( ) ;
if ( ! path ) {
ret = - ENOMEM ;
goto fail ;
}
2015-01-02 19:36:14 +01:00
key . objectid = BTRFS_ORPHAN_OBJECTID ;
key . type = BTRFS_ORPHAN_ITEM_KEY ;
2020-05-15 19:35:55 +02:00
key . offset = objectid ;
2015-01-02 19:36:14 +01:00
ret = btrfs_search_slot ( NULL , fs_info - > tree_root , & key , path , 0 , 0 ) ;
2015-01-02 18:45:16 +01:00
btrfs_free_path ( path ) ;
2010-05-16 10:49:58 -04:00
if ( ret < 0 )
goto fail ;
if ( ret = = 0 )
2014-04-02 19:51:05 +08:00
set_bit ( BTRFS_ROOT_ORPHAN_ITEM_INSERTED , & root - > state ) ;
2010-05-16 10:49:58 -04:00
2013-05-15 07:48:19 +00:00
ret = btrfs_insert_fs_root ( fs_info , root ) ;
2007-04-09 10:42:37 -04:00
if ( ret ) {
2022-03-24 06:44:54 -07:00
if ( ret = = - EEXIST ) {
btrfs_put_root ( root ) ;
2009-09-21 15:56:00 -04:00
goto again ;
2022-03-24 06:44:54 -07:00
}
2009-09-21 15:56:00 -04:00
goto fail ;
2007-04-09 10:42:37 -04:00
}
2007-12-21 16:27:24 -05:00
return root ;
2009-09-21 15:56:00 -04:00
fail :
btrfs: fix double free of anon_dev after failure to create subvolume
When creating a subvolume, at create_subvol(), we allocate an anonymous
device and later call btrfs_get_new_fs_root(), which in turn just calls
btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
the anonymous device to the root, but if after that call there's an error,
when we jump to 'fail' label, we call btrfs_put_root(), which frees the
anonymous device and then returns an error that is propagated back to
create_subvol(). Than create_subvol() frees the anonymous device again.
When this happens, if the anonymous device was not reallocated after
the first time it was freed with btrfs_put_root(), we get a kernel
message like the following:
(...)
[13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
[13950.283027] ida_free called for id=65 which is not allocated.
[13950.285974] BTRFS info (device dm-0): forced readonly
(...)
If the anonymous device gets reallocated by another btrfs filesystem
or any other kernel subsystem, then bad things can happen.
So fix this by setting the root's anonymous device to 0 at
btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
happened.
Fixes: 2dfb1e43f57dd3 ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-10 19:02:18 +00:00
/*
* If our caller provided us an anonymous device , then it ' s his
2022-05-25 16:27:25 +02:00
* responsibility to free it in case we fail . So we have to set our
btrfs: fix double free of anon_dev after failure to create subvolume
When creating a subvolume, at create_subvol(), we allocate an anonymous
device and later call btrfs_get_new_fs_root(), which in turn just calls
btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
the anonymous device to the root, but if after that call there's an error,
when we jump to 'fail' label, we call btrfs_put_root(), which frees the
anonymous device and then returns an error that is propagated back to
create_subvol(). Than create_subvol() frees the anonymous device again.
When this happens, if the anonymous device was not reallocated after
the first time it was freed with btrfs_put_root(), we get a kernel
message like the following:
(...)
[13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
[13950.283027] ida_free called for id=65 which is not allocated.
[13950.285974] BTRFS info (device dm-0): forced readonly
(...)
If the anonymous device gets reallocated by another btrfs filesystem
or any other kernel subsystem, then bad things can happen.
So fix this by setting the root's anonymous device to 0 at
btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
happened.
Fixes: 2dfb1e43f57dd3 ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-10 19:02:18 +00:00
* root ' s anon_dev to 0 to avoid a double free , once by btrfs_put_root ( )
* and once again by our caller .
*/
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
if ( anon_dev & & * anon_dev )
btrfs: fix double free of anon_dev after failure to create subvolume
When creating a subvolume, at create_subvol(), we allocate an anonymous
device and later call btrfs_get_new_fs_root(), which in turn just calls
btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
the anonymous device to the root, but if after that call there's an error,
when we jump to 'fail' label, we call btrfs_put_root(), which frees the
anonymous device and then returns an error that is propagated back to
create_subvol(). Than create_subvol() frees the anonymous device again.
When this happens, if the anonymous device was not reallocated after
the first time it was freed with btrfs_put_root(), we get a kernel
message like the following:
(...)
[13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
[13950.283027] ida_free called for id=65 which is not allocated.
[13950.285974] BTRFS info (device dm-0): forced readonly
(...)
If the anonymous device gets reallocated by another btrfs filesystem
or any other kernel subsystem, then bad things can happen.
So fix this by setting the root's anonymous device to 0 at
btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
happened.
Fixes: 2dfb1e43f57dd3 ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-10 19:02:18 +00:00
root - > anon_dev = 0 ;
2020-02-14 16:11:42 -05:00
btrfs_put_root ( root ) ;
2009-09-21 15:56:00 -04:00
return ERR_PTR ( ret ) ;
2007-12-21 16:27:24 -05:00
}
2020-06-16 10:17:36 +08:00
/*
* Get in - memory reference of a root structure
*
* @ objectid : tree objectid
* @ check_ref : if set , verify that the tree exists and the item has at least
* one reference
*/
struct btrfs_root * btrfs_get_fs_root ( struct btrfs_fs_info * fs_info ,
u64 objectid , bool check_ref )
{
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
return btrfs_get_root_ref ( fs_info , objectid , NULL , check_ref ) ;
2020-06-16 10:17:36 +08:00
}
/*
* Get in - memory reference of a root structure , created as new , optionally pass
* the anonymous block device id
*
* @ objectid : tree objectid
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
* @ anon_dev : if NULL , allocate a new anonymous block device or use the
* parameter value if not NULL
2020-06-16 10:17:36 +08:00
*/
struct btrfs_root * btrfs_get_new_fs_root ( struct btrfs_fs_info * fs_info ,
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-23 16:38:43 +00:00
u64 objectid , dev_t * anon_dev )
2020-06-16 10:17:36 +08:00
{
return btrfs_get_root_ref ( fs_info , objectid , anon_dev , true ) ;
}
2020-10-19 16:02:31 -04:00
/*
2023-09-08 01:09:25 +02:00
* Return a root for the given objectid .
*
2020-10-19 16:02:31 -04:00
* @ fs_info : the fs_info
* @ objectid : the objectid we need to lookup
*
* This is exclusively used for backref walking , and exists specifically because
* of how qgroups does lookups . Qgroups will do a backref lookup at delayed ref
* creation time , which means we may have to read the tree_root in order to look
* up a fs root that is not in memory . If the root is not in memory we will
* read the tree root commit root and look up the fs root from there . This is a
* temporary root , it will not be inserted into the radix tree as it doesn ' t
* have the most uptodate information , it ' ll simply be discarded once the
* backref code is finished using the root .
*/
struct btrfs_root * btrfs_get_fs_root_commit_root ( struct btrfs_fs_info * fs_info ,
struct btrfs_path * path ,
u64 objectid )
{
struct btrfs_root * root ;
struct btrfs_key key ;
ASSERT ( path - > search_commit_root & & path - > skip_locking ) ;
/*
* This can return - ENOENT if we ask for a root that doesn ' t exist , but
* since this is called via the backref walking code we won ' t be looking
* up a root that doesn ' t exist , unless there ' s corruption . So if root
* ! = NULL just return it .
*/
root = btrfs_get_global_root ( fs_info , objectid ) ;
if ( root )
return root ;
root = btrfs_lookup_fs_root ( fs_info , objectid ) ;
if ( root )
return root ;
key . objectid = objectid ;
key . type = BTRFS_ROOT_ITEM_KEY ;
key . offset = ( u64 ) - 1 ;
root = read_tree_root_path ( fs_info - > tree_root , path , & key ) ;
btrfs_release_path ( path ) ;
return root ;
}
2008-06-25 16:01:31 -04:00
static int cleaner_kthread ( void * arg )
{
2022-03-31 03:34:08 -07:00
struct btrfs_fs_info * fs_info = arg ;
2013-05-14 10:20:40 +00:00
int again ;
2008-06-25 16:01:31 -04:00
Btrfs: fix missing delayed iputs on unmount
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-31 10:06:08 -07:00
while ( 1 ) {
2013-05-14 10:20:40 +00:00
again = 0 ;
2008-06-25 16:01:31 -04:00
2019-01-11 10:21:02 -05:00
set_bit ( BTRFS_FS_CLEANER_RUNNING , & fs_info - > flags ) ;
2013-05-14 10:20:40 +00:00
/* Make the cleaner go to sleep early. */
2016-06-22 18:54:24 -04:00
if ( btrfs_need_cleaner_sleep ( fs_info ) )
2013-05-14 10:20:40 +00:00
goto sleep ;
2016-06-12 23:39:58 -04:00
/*
* Do not do anything if we might cause open_ctree ( ) to block
* before we have finished mounting the filesystem .
*/
2016-06-22 18:54:23 -04:00
if ( ! test_bit ( BTRFS_FS_OPEN , & fs_info - > flags ) )
2016-06-12 23:39:58 -04:00
goto sleep ;
2016-06-22 18:54:23 -04:00
if ( ! mutex_trylock ( & fs_info - > cleaner_mutex ) )
2013-05-14 10:20:40 +00:00
goto sleep ;
2013-05-14 10:20:42 +00:00
/*
* Avoid the problem that we change the status of the fs
* during the above check and trylock .
*/
2016-06-22 18:54:24 -04:00
if ( btrfs_need_cleaner_sleep ( fs_info ) ) {
2016-06-22 18:54:23 -04:00
mutex_unlock ( & fs_info - > cleaner_mutex ) ;
2013-05-14 10:20:42 +00:00
goto sleep ;
2009-09-21 16:00:26 -04:00
}
2008-06-25 16:01:31 -04:00
btrfs: sysfs: update fs features directory asynchronously
[BUG]
Since the introduction of per-fs feature sysfs interface
(/sys/fs/btrfs/<UUID>/features/), the content of that directory is never
updated.
Thus for the following case, that directory will not show the new
features like RAID56:
# mkfs.btrfs -f $dev1 $dev2 $dev3
# mount $dev1 $mnt
# btrfs balance start -f -mconvert=raid5 $mnt
# ls /sys/fs/btrfs/$uuid/features/
extended_iref free_space_tree no_holes skinny_metadata
While after unmount and mount, we got the correct features:
# umount $mnt
# mount $dev1 $mnt
# ls /sys/fs/btrfs/$uuid/features/
extended_iref free_space_tree no_holes raid56 skinny_metadata
[CAUSE]
Because we never really try to update the content of per-fs features/
directory.
We had an attempt to update the features directory dynamically in commit
14e46e04958d ("btrfs: synchronize incompat feature bits with sysfs
files"), but unfortunately it get reverted in commit e410e34fad91
("Revert "btrfs: synchronize incompat feature bits with sysfs files"").
The problem in the original patch is, in the context of
btrfs_create_chunk(), we can not afford to update the sysfs group.
The exported but never utilized function, btrfs_sysfs_feature_update()
is the leftover of such attempt. As even if we go sysfs_update_group(),
new files will need extra memory allocation, and we have no way to
specify the sysfs update to go GFP_NOFS.
[FIX]
This patch will address the old problem by doing asynchronous sysfs
update in the cleaner thread.
This involves the following changes:
- Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set
BTRFS_FS_FEATURE_CHANGED flag when needed
- Update btrfs_sysfs_feature_update() to use sysfs_update_group()
And drop unnecessary arguments.
- Call btrfs_sysfs_feature_update() in cleaner_kthread
If we have the BTRFS_FS_FEATURE_CHANGED flag set.
- Wake up cleaner_kthread in btrfs_commit_transaction if we have
BTRFS_FS_FEATURE_CHANGED flag
By this, all the previously dangerous call sites like
btrfs_create_chunk() need no new changes, as above helpers would
have already set the BTRFS_FS_FEATURE_CHANGED flag.
The real work happens at cleaner_kthread, thus we pay the cost of
delaying the update to sysfs directory, but the delayed time should be
small enough that end user can not distinguish though it might get
delayed if the cleaner thread is busy with removing subvolumes or
defrag.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-13 19:11:39 +08:00
if ( test_and_clear_bit ( BTRFS_FS_FEATURE_CHANGED , & fs_info - > flags ) )
btrfs_sysfs_feature_update ( fs_info ) ;
2016-06-22 18:54:24 -04:00
btrfs_run_delayed_iputs ( fs_info ) ;
Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:
[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
[ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108
[ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
[ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d
[ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c
[ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266
[ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
[ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.
Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.
Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-15 11:05:12 +00:00
2022-02-18 14:56:11 -05:00
again = btrfs_clean_one_deleted_snapshot ( fs_info ) ;
2016-06-22 18:54:23 -04:00
mutex_unlock ( & fs_info - > cleaner_mutex ) ;
2013-05-14 10:20:40 +00:00
/*
2013-05-14 10:20:41 +00:00
* The defragger has dealt with the R / O remount and umount ,
* needn ' t do anything special here .
2013-05-14 10:20:40 +00:00
*/
2016-06-22 18:54:23 -04:00
btrfs_run_defrag_inodes ( fs_info ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
/*
2021-04-19 16:41:01 +09:00
* Acquires fs_info - > reclaim_bgs_lock to avoid racing
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
* with relocation ( btrfs_relocate_chunk ) and relocation
* acquires fs_info - > cleaner_mutex ( btrfs_relocate_block_group )
2021-04-19 16:41:01 +09:00
* after acquiring fs_info - > reclaim_bgs_lock . So we
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
* can ' t hold , nor need to , fs_info - > cleaner_mutex when deleting
* unused block groups .
*/
2016-06-22 18:54:23 -04:00
btrfs_delete_unused_bgs ( fs_info ) ;
2021-04-19 16:41:02 +09:00
/*
* Reclaim block groups in the reclaim_bgs list after we deleted
* all unused block_groups . This possibly gives us some more free
* space .
*/
btrfs_reclaim_bgs ( fs_info ) ;
2013-05-14 10:20:40 +00:00
sleep :
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
clear_and_wake_up_bit ( BTRFS_FS_CLEANER_RUNNING , & fs_info - > flags ) ;
Btrfs: fix missing delayed iputs on unmount
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-31 10:06:08 -07:00
if ( kthread_should_park ( ) )
kthread_parkme ( ) ;
if ( kthread_should_stop ( ) )
return 0 ;
2016-03-15 11:28:54 +01:00
if ( ! again ) {
2008-06-25 16:01:31 -04:00
set_current_state ( TASK_INTERRUPTIBLE ) ;
Btrfs: fix missing delayed iputs on unmount
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-31 10:06:08 -07:00
schedule ( ) ;
2008-06-25 16:01:31 -04:00
__set_current_state ( TASK_RUNNING ) ;
}
Btrfs: fix crash on close_ctree() if cleaner starts new transaction
Often when running fstests btrfs/079 I was running into the following
trace during umount on one of my qemu/kvm test vms:
[ 8245.682441] WARNING: CPU: 8 PID: 25064 at fs/btrfs/extent-tree.c:138 btrfs_put_block_group+0x51/0x69 [btrfs]()
[ 8245.685039] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
[ 8245.693860] CPU: 8 PID: 25064 Comm: umount Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[ 8245.695081] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[ 8245.697583] 0000000000000009 ffff88020d047ce8 ffffffff8145eec7 ffffffff81095dce
[ 8245.699234] 0000000000000000 ffff88020d047d28 ffffffff8104b399 0000000000000028
[ 8245.700995] ffffffffa04db07b ffff8801c6036c00 ffff8801c6036d68 ffff880202eb40b0
[ 8245.702510] Call Trace:
[ 8245.703006] [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[ 8245.705393] [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[ 8245.706569] [<ffffffff8104b399>] warn_slowpath_common+0xa1/0xbb
[ 8245.707747] [<ffffffffa04db07b>] ? btrfs_put_block_group+0x51/0x69 [btrfs]
[ 8245.709101] [<ffffffff8104b456>] warn_slowpath_null+0x1a/0x1c
[ 8245.710274] [<ffffffffa04db07b>] btrfs_put_block_group+0x51/0x69 [btrfs]
[ 8245.711823] [<ffffffffa04e3473>] btrfs_free_block_groups+0x145/0x322 [btrfs]
[ 8245.713251] [<ffffffffa04ef31a>] close_ctree+0x1ef/0x325 [btrfs]
[ 8245.714448] [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
[ 8245.715539] [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
[ 8245.716835] [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
[ 8245.718015] [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
[ 8245.719101] [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 8245.720316] [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
[ 8245.721517] [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
[ 8245.722581] [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
[ 8245.723538] [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
[ 8245.724572] [<ffffffff81065371>] task_work_run+0x8f/0xbc
[ 8245.725598] [<ffffffff810028fb>] do_notify_resume+0x45/0x53
[ 8245.726892] [<ffffffff814651ac>] int_signal+0x12/0x17
[ 8245.737887] ---[ end trace a01d038397e99b92 ]---
[ 8245.769363] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 8245.770737] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
[ 8245.772641] CPU: 2 PID: 25064 Comm: umount Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[ 8245.772641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[ 8245.772641] task: ffff880013005810 ti: ffff88020d044000 task.ti: ffff88020d044000
[ 8245.772641] RIP: 0010:[<ffffffffa051c8e6>] [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
[ 8245.772641] RSP: 0018:ffff88020d0478b8 EFLAGS: 00010202
[ 8245.772641] RAX: 0000000000000004 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffa0581488
[ 8245.772641] RDX: 0000000000000000 RSI: ffff880194b7bf48 RDI: ffff880144b6a7a0
[ 8245.772641] RBP: ffff88020d0478d8 R08: 0000000000000000 R09: 000000000000ffff
[ 8245.772641] R10: 0000000000000004 R11: 0000000000000005 R12: ffff880194b7bf48
[ 8245.772641] R13: ffff880194b7bf48 R14: 0000000000000410 R15: 0000000000000000
[ 8245.772641] FS: 00007f991e77d840(0000) GS:ffff88023e280000(0000) knlGS:0000000000000000
[ 8245.772641] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 8245.772641] CR2: 00007fbbd325ee68 CR3: 000000021de8e000 CR4: 00000000000006e0
[ 8245.772641] Stack:
[ 8245.772641] ffff880194b7bf00 ffff880202eb4000 ffff880194b7bf48 0000000000000410
[ 8245.772641] ffff88020d047958 ffffffffa04ec6d5 ffff8801629b2ee8 0000000082987570
[ 8245.772641] 0000000000a5813f 0000000000000001 ffff880013006100 0000000000000002
[ 8245.772641] Call Trace:
[ 8245.772641] [<ffffffffa04ec6d5>] btrfs_wq_submit_bio+0xe1/0x17b [btrfs]
[ 8245.772641] [<ffffffff81086bff>] ? check_irq_usage+0x76/0x87
[ 8245.772641] [<ffffffffa04ec825>] btree_submit_bio_hook+0xb6/0xd9 [btrfs]
[ 8245.772641] [<ffffffffa04ebb7c>] ? btree_csum_one_bio+0xad/0xad [btrfs]
[ 8245.772641] [<ffffffffa04eb1a6>] ? btree_io_failed_hook+0x5e/0x5e [btrfs]
[ 8245.772641] [<ffffffffa050a6e7>] submit_one_bio+0x8c/0xc7 [btrfs]
[ 8245.772641] [<ffffffffa050d75b>] submit_extent_page.isra.18+0x9d/0x186 [btrfs]
[ 8245.772641] [<ffffffffa050d95b>] write_one_eb+0x117/0x1ae [btrfs]
[ 8245.772641] [<ffffffffa050a79b>] ? end_extent_buffer_writeback+0x21/0x21 [btrfs]
[ 8245.772641] [<ffffffffa0510510>] btree_write_cache_pages+0x2ab/0x385 [btrfs]
[ 8245.772641] [<ffffffffa04eb2b8>] btree_writepages+0x23/0x5c [btrfs]
[ 8245.772641] [<ffffffff8111c661>] do_writepages+0x23/0x2c
[ 8245.772641] [<ffffffff81189cd4>] __writeback_single_inode+0xda/0x5bd
[ 8245.772641] [<ffffffff8118aa60>] ? writeback_single_inode+0x2b/0x173
[ 8245.772641] [<ffffffff8118aafd>] writeback_single_inode+0xc8/0x173
[ 8245.772641] [<ffffffff8118ac95>] write_inode_now+0x8a/0x95
[ 8245.772641] [<ffffffff81247bf0>] ? _atomic_dec_and_lock+0x30/0x4e
[ 8245.772641] [<ffffffff8117cc5e>] iput+0x17d/0x26a
[ 8245.772641] [<ffffffffa04ef355>] close_ctree+0x22a/0x325 [btrfs]
[ 8245.772641] [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
[ 8245.772641] [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
[ 8245.772641] [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
[ 8245.772641] [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
[ 8245.772641] [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 8245.772641] [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
[ 8245.772641] [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
[ 8245.772641] [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
[ 8245.772641] [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
[ 8245.772641] [<ffffffff81065371>] task_work_run+0x8f/0xbc
[ 8245.772641] [<ffffffff810028fb>] do_notify_resume+0x45/0x53
[ 8245.772641] [<ffffffff814651ac>] int_signal+0x12/0x17
[ 8245.772641] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 5c ff 74 04 f0 ff 43 50 49 83 7c 24 08 00 74 2c 4c 8d 6b
[ 8245.772641] RIP [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
[ 8245.772641] RSP <ffff88020d0478b8>
[ 8245.845040] ---[ end trace a01d038397e99b93 ]---
For logical reasons such as the phase of the moon, this happened more
often with "-o inode_cache" than without any mount options.
After some debugging it turned out to be simple to understand what was
happening:
1) close_ctree() is called;
2) It then stops the transaction kthread, which commits the current
transaction;
3) It asks the cleaner kthread to stop, which is currently running
btrfs_delete_unused_bgs();
4) btrfs_delete_unused_bgs() finds an unused block group, starts a new
transaction, deletes the block group, which implies COWing some
tree nodes and leafs and dirtying their respective pages, and then
finally it ends the transaction it started, without committing it;
5) The cleaner kthread stops;
6) close_ctree() releases (from memory) the block group objects, which
produces the warning in the trace pasted above;
7) Then it invalidates all pages of the btree inode, by calling
invalidate_inode_pages2(), which waits for any pages under writeback,
and releases any non-dirty pages;
8) All work queues are destroyed (waiting first for their current tasks
to finish execution);
9) A final iput() is called against the btree inode;
10) This iput triggers a writeback of the btree inode because it still
has dirty pages;
11) This starts the whole chain of callbacks for the btree inode until
it eventually reaches btrfs_wq_submit_bio() where it leads to a
NULL pointer dereference because the work queues were already
destroyed.
Fix this by making the cleaner commit any transaction that it started
after the transaction kthread was stopped.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-13 06:55:31 +01:00
}
2008-06-25 16:01:31 -04:00
}
static int transaction_kthread ( void * arg )
{
struct btrfs_root * root = arg ;
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2008-06-25 16:01:31 -04:00
struct btrfs_trans_handle * trans ;
struct btrfs_transaction * cur ;
2010-05-16 10:49:58 -04:00
u64 transid ;
2020-10-08 15:24:29 +03:00
time64_t delta ;
2008-06-25 16:01:31 -04:00
unsigned long delay ;
2012-03-12 16:05:50 +01:00
bool cannot_commit ;
2008-06-25 16:01:31 -04:00
do {
2012-03-12 16:05:50 +01:00
cannot_commit = false ;
2020-10-08 15:24:27 +03:00
delay = msecs_to_jiffies ( fs_info - > commit_interval * 1000 ) ;
2016-06-22 18:54:23 -04:00
mutex_lock ( & fs_info - > transaction_kthread_mutex ) ;
2008-06-25 16:01:31 -04:00
2016-06-22 18:54:23 -04:00
spin_lock ( & fs_info - > trans_lock ) ;
cur = fs_info - > running_transaction ;
2008-06-25 16:01:31 -04:00
if ( ! cur ) {
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2008-06-25 16:01:31 -04:00
goto sleep ;
}
2008-07-28 15:32:19 -04:00
2020-10-08 15:24:29 +03:00
delta = ktime_get_seconds ( ) - cur - > start_time ;
2021-11-05 16:45:28 -04:00
if ( ! test_and_clear_bit ( BTRFS_FS_COMMIT_TRANS , & fs_info - > flags ) & &
2023-08-24 16:59:22 -04:00
cur - > state < TRANS_STATE_COMMIT_PREP & &
2020-10-08 15:24:29 +03:00
delta < fs_info - > commit_interval ) {
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2020-10-20 12:44:17 +03:00
delay - = msecs_to_jiffies ( ( delta - 1 ) * 1000 ) ;
delay = min ( delay ,
msecs_to_jiffies ( fs_info - > commit_interval * 1000 ) ) ;
2008-06-25 16:01:31 -04:00
goto sleep ;
}
2010-05-16 10:49:58 -04:00
transid = cur - > transid ;
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2009-03-13 10:10:06 -04:00
2012-03-12 16:03:00 +01:00
/* If the file system is aborted, this will always fail. */
Btrfs: fix orphan transaction on the freezed filesystem
With the following debug patch:
static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-09-20 01:54:00 -06:00
trans = btrfs_attach_transaction ( root ) ;
2012-03-12 16:05:50 +01:00
if ( IS_ERR ( trans ) ) {
Btrfs: fix orphan transaction on the freezed filesystem
With the following debug patch:
static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-09-20 01:54:00 -06:00
if ( PTR_ERR ( trans ) ! = - ENOENT )
cannot_commit = true ;
2012-03-12 16:03:00 +01:00
goto sleep ;
2012-03-12 16:05:50 +01:00
}
2010-05-16 10:49:58 -04:00
if ( transid = = trans - > transid ) {
2016-09-09 21:39:03 -04:00
btrfs_commit_transaction ( trans ) ;
2010-05-16 10:49:58 -04:00
} else {
2016-09-09 21:39:03 -04:00
btrfs_end_transaction ( trans ) ;
2010-05-16 10:49:58 -04:00
}
2008-06-25 16:01:31 -04:00
sleep :
2016-06-22 18:54:23 -04:00
wake_up_process ( fs_info - > cleaner_kthread ) ;
mutex_unlock ( & fs_info - > transaction_kthread_mutex ) ;
2008-06-25 16:01:31 -04:00
2021-10-05 16:35:25 -04:00
if ( BTRFS_FS_ERROR ( fs_info ) )
2016-06-22 18:54:24 -04:00
btrfs_cleanup_transaction ( fs_info ) ;
2016-03-15 11:28:59 +01:00
if ( ! kthread_should_stop ( ) & &
2016-06-22 18:54:23 -04:00
( ! btrfs_transaction_blocked ( fs_info ) | |
2016-03-15 11:28:59 +01:00
cannot_commit ) )
2018-01-23 14:46:53 +02:00
schedule_timeout_interruptible ( delay ) ;
2008-06-25 16:01:31 -04:00
} while ( ! kthread_should_stop ( ) ) ;
return 0 ;
}
2011-11-03 15:17:42 -04:00
/*
2019-10-15 18:42:17 +03:00
* This will find the highest generation in the array of root backups . The
* index of the highest array is returned , or - EINVAL if we can ' t find
* anything .
2011-11-03 15:17:42 -04:00
*
* We check to make sure the array is valid by comparing the
* generation of the latest root in the array with the generation
* in the super block . If they don ' t match we pitch it .
*/
2019-10-15 18:42:17 +03:00
static int find_newest_super_backup ( struct btrfs_fs_info * info )
2011-11-03 15:17:42 -04:00
{
2019-10-15 18:42:17 +03:00
const u64 newest_gen = btrfs_super_generation ( info - > super_copy ) ;
2011-11-03 15:17:42 -04:00
u64 cur ;
struct btrfs_root_backup * root_backup ;
int i ;
for ( i = 0 ; i < BTRFS_NUM_BACKUP_ROOTS ; i + + ) {
root_backup = info - > super_copy - > super_roots + i ;
cur = btrfs_backup_tree_root_gen ( root_backup ) ;
if ( cur = = newest_gen )
2019-10-15 18:42:17 +03:00
return i ;
2011-11-03 15:17:42 -04:00
}
2019-10-15 18:42:17 +03:00
return - EINVAL ;
2011-11-03 15:17:42 -04:00
}
/*
* copy all the root pointers into the super backup array .
* this will bump the backup pointer by one when it is
* done
*/
static void backup_super_roots ( struct btrfs_fs_info * info )
{
2019-10-15 18:42:24 +03:00
const int next_backup = info - > backup_root_index ;
2011-11-03 15:17:42 -04:00
struct btrfs_root_backup * root_backup ;
root_backup = info - > super_for_commit - > super_roots + next_backup ;
/*
* make sure all of our padding and empty slots get zero filled
* regardless of which ones we use today
*/
memset ( root_backup , 0 , sizeof ( * root_backup ) ) ;
info - > backup_root_index = ( next_backup + 1 ) % BTRFS_NUM_BACKUP_ROOTS ;
btrfs_set_backup_tree_root ( root_backup , info - > tree_root - > node - > start ) ;
btrfs_set_backup_tree_root_gen ( root_backup ,
btrfs_header_generation ( info - > tree_root - > node ) ) ;
btrfs_set_backup_tree_root_level ( root_backup ,
btrfs_header_level ( info - > tree_root - > node ) ) ;
btrfs_set_backup_chunk_root ( root_backup , info - > chunk_root - > node - > start ) ;
btrfs_set_backup_chunk_root_gen ( root_backup ,
btrfs_header_generation ( info - > chunk_root - > node ) ) ;
btrfs_set_backup_chunk_root_level ( root_backup ,
btrfs_header_level ( info - > chunk_root - > node ) ) ;
2022-08-09 13:02:18 +08:00
if ( ! btrfs_fs_compat_ro ( info , BLOCK_GROUP_TREE ) ) {
2021-12-15 15:40:07 -05:00
struct btrfs_root * extent_root = btrfs_extent_root ( info , 0 ) ;
2021-12-15 15:40:08 -05:00
struct btrfs_root * csum_root = btrfs_csum_root ( info , 0 ) ;
2021-12-15 15:40:07 -05:00
btrfs_set_backup_extent_root ( root_backup ,
extent_root - > node - > start ) ;
btrfs_set_backup_extent_root_gen ( root_backup ,
btrfs_header_generation ( extent_root - > node ) ) ;
btrfs_set_backup_extent_root_level ( root_backup ,
btrfs_header_level ( extent_root - > node ) ) ;
2021-12-15 15:40:08 -05:00
btrfs_set_backup_csum_root ( root_backup , csum_root - > node - > start ) ;
btrfs_set_backup_csum_root_gen ( root_backup ,
btrfs_header_generation ( csum_root - > node ) ) ;
btrfs_set_backup_csum_root_level ( root_backup ,
btrfs_header_level ( csum_root - > node ) ) ;
2021-12-15 15:40:07 -05:00
}
2011-11-03 15:17:42 -04:00
2011-11-06 18:50:56 -05:00
/*
* we might commit during log recovery , which happens before we set
* the fs_root . Make sure it is valid before we fill it in .
*/
if ( info - > fs_root & & info - > fs_root - > node ) {
btrfs_set_backup_fs_root ( root_backup ,
info - > fs_root - > node - > start ) ;
btrfs_set_backup_fs_root_gen ( root_backup ,
2011-11-03 15:17:42 -04:00
btrfs_header_generation ( info - > fs_root - > node ) ) ;
2011-11-06 18:50:56 -05:00
btrfs_set_backup_fs_root_level ( root_backup ,
2011-11-03 15:17:42 -04:00
btrfs_header_level ( info - > fs_root - > node ) ) ;
2011-11-06 18:50:56 -05:00
}
2011-11-03 15:17:42 -04:00
btrfs_set_backup_dev_root ( root_backup , info - > dev_root - > node - > start ) ;
btrfs_set_backup_dev_root_gen ( root_backup ,
btrfs_header_generation ( info - > dev_root - > node ) ) ;
btrfs_set_backup_dev_root_level ( root_backup ,
btrfs_header_level ( info - > dev_root - > node ) ) ;
btrfs_set_backup_total_bytes ( root_backup ,
btrfs_super_total_bytes ( info - > super_copy ) ) ;
btrfs_set_backup_bytes_used ( root_backup ,
btrfs_super_bytes_used ( info - > super_copy ) ) ;
btrfs_set_backup_num_devices ( root_backup ,
btrfs_super_num_devices ( info - > super_copy ) ) ;
/*
* if we don ' t copy this out to the super_copy , it won ' t get remembered
* for the next commit
*/
memcpy ( & info - > super_copy - > super_roots ,
& info - > super_for_commit - > super_roots ,
sizeof ( * root_backup ) * BTRFS_NUM_BACKUP_ROOTS ) ;
}
2019-10-15 18:42:19 +03:00
/*
2023-09-08 01:09:25 +02:00
* Reads a backup root based on the passed priority . Prio 0 is the newest , prio
* 1 / 2 / 3 are 2 nd newest / 3 rd newest / 4 th ( oldest ) backup roots
2019-10-15 18:42:19 +03:00
*
2023-09-08 01:09:25 +02:00
* @ fs_info : filesystem whose backup roots need to be read
* @ priority : priority of backup root required
2019-10-15 18:42:19 +03:00
*
* Returns backup root index on success and - EINVAL otherwise .
*/
static int read_backup_root ( struct btrfs_fs_info * fs_info , u8 priority )
{
int backup_index = find_newest_super_backup ( fs_info ) ;
struct btrfs_super_block * super = fs_info - > super_copy ;
struct btrfs_root_backup * root_backup ;
if ( priority < BTRFS_NUM_BACKUP_ROOTS & & backup_index > = 0 ) {
if ( priority = = 0 )
return backup_index ;
backup_index = backup_index + BTRFS_NUM_BACKUP_ROOTS - priority ;
backup_index % = BTRFS_NUM_BACKUP_ROOTS ;
} else {
return - EINVAL ;
}
root_backup = super - > super_roots + backup_index ;
btrfs_set_super_generation ( super ,
btrfs_backup_tree_root_gen ( root_backup ) ) ;
btrfs_set_super_root ( super , btrfs_backup_tree_root ( root_backup ) ) ;
btrfs_set_super_root_level ( super ,
btrfs_backup_tree_root_level ( root_backup ) ) ;
btrfs_set_super_bytes_used ( super , btrfs_backup_bytes_used ( root_backup ) ) ;
/*
* Fixme : the total bytes and num_devices need to match or we should
* need a fsck
*/
btrfs_set_super_total_bytes ( super , btrfs_backup_total_bytes ( root_backup ) ) ;
btrfs_set_super_num_devices ( super , btrfs_backup_num_devices ( root_backup ) ) ;
return backup_index ;
}
2013-03-17 02:10:31 +00:00
/* helper to cleanup workers */
static void btrfs_stop_all_workers ( struct btrfs_fs_info * fs_info )
{
2014-02-28 10:46:14 +08:00
btrfs_destroy_workqueue ( fs_info - > fixup_workers ) ;
2014-02-28 10:46:07 +08:00
btrfs_destroy_workqueue ( fs_info - > delalloc_workers ) ;
2014-02-28 10:46:06 +08:00
btrfs_destroy_workqueue ( fs_info - > workers ) ;
2022-05-26 09:36:40 +02:00
if ( fs_info - > endio_workers )
destroy_workqueue ( fs_info - > endio_workers ) ;
2022-04-18 06:43:11 +02:00
if ( fs_info - > rmw_workers )
destroy_workqueue ( fs_info - > rmw_workers ) ;
2022-05-26 09:36:38 +02:00
if ( fs_info - > compressed_write_workers )
destroy_workqueue ( fs_info - > compressed_write_workers ) ;
2014-02-28 10:46:10 +08:00
btrfs_destroy_workqueue ( fs_info - > endio_write_workers ) ;
btrfs_destroy_workqueue ( fs_info - > endio_freespace_worker ) ;
2014-02-28 10:46:15 +08:00
btrfs_destroy_workqueue ( fs_info - > delayed_workers ) ;
2014-02-28 10:46:12 +08:00
btrfs_destroy_workqueue ( fs_info - > caching_workers ) ;
2014-02-28 10:46:09 +08:00
btrfs_destroy_workqueue ( fs_info - > flush_workers ) ;
2014-02-28 10:46:16 +08:00
btrfs_destroy_workqueue ( fs_info - > qgroup_rescan_workers ) ;
2019-12-13 16:22:14 -08:00
if ( fs_info - > discard_ctl . discard_workers )
destroy_workqueue ( fs_info - > discard_ctl . discard_workers ) ;
2017-02-04 17:12:00 +00:00
/*
* Now that all other work queues are destroyed , we can safely destroy
* the queues used for metadata I / O , since tasks from those other work
* queues can do metadata I / O operations .
*/
2022-05-26 09:36:40 +02:00
if ( fs_info - > endio_meta_workers )
destroy_workqueue ( fs_info - > endio_meta_workers ) ;
2013-03-17 02:10:31 +00:00
}
2013-10-31 02:45:20 +05:30
static void free_root_extent_buffers ( struct btrfs_root * root )
{
if ( root ) {
free_extent_buffer ( root - > node ) ;
free_extent_buffer ( root - > commit_root ) ;
root - > node = NULL ;
root - > commit_root = NULL ;
}
}
2021-11-05 16:45:51 -04:00
static void free_global_root_pointers ( struct btrfs_fs_info * fs_info )
{
struct btrfs_root * root , * tmp ;
rbtree_postorder_for_each_entry_safe ( root , tmp ,
& fs_info - > global_root_tree ,
rb_node )
free_root_extent_buffers ( root ) ;
}
2011-11-03 15:17:42 -04:00
/* helper to cleanup tree roots */
2019-10-10 10:39:25 +08:00
static void free_root_pointers ( struct btrfs_fs_info * info , bool free_chunk_root )
2011-11-03 15:17:42 -04:00
{
2013-10-31 02:45:20 +05:30
free_root_extent_buffers ( info - > tree_root ) ;
2013-05-17 14:06:51 -04:00
2021-11-05 16:45:51 -04:00
free_global_root_pointers ( info ) ;
2013-10-31 02:45:20 +05:30
free_root_extent_buffers ( info - > dev_root ) ;
free_root_extent_buffers ( info - > quota_root ) ;
free_root_extent_buffers ( info - > uuid_root ) ;
2020-02-14 16:11:42 -05:00
free_root_extent_buffers ( info - > fs_root ) ;
2020-05-15 14:01:42 +08:00
free_root_extent_buffers ( info - > data_reloc_root ) ;
2021-12-15 15:40:07 -05:00
free_root_extent_buffers ( info - > block_group_root ) ;
2023-09-14 09:06:57 -07:00
free_root_extent_buffers ( info - > stripe_root ) ;
2019-10-10 10:39:25 +08:00
if ( free_chunk_root )
2013-10-31 02:45:20 +05:30
free_root_extent_buffers ( info - > chunk_root ) ;
2011-11-03 15:17:42 -04:00
}
2020-02-14 16:11:42 -05:00
void btrfs_put_root ( struct btrfs_root * root )
{
if ( ! root )
return ;
if ( refcount_dec_and_test ( & root - > refs ) ) {
btrfs: use an xarray to track open inodes in a root
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.
This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 92081.6 12505547
16 2400000 0 138222.6 13067072
23 3600000 0 148833.1 13290336
43 4800000 0 97864.7 13931248
53 6000000 0 85597.3 14384313
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 93225.1 12571078
16 2400000 0 146720.3 12805007
23 3600000 0 160626.4 13073835
46 4800000 0 116286.2 13802927
53 6000000 0 90087.9 14754892
The test was run with a release kernel config (Debian's default config).
Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):
Before this patch, inode_tree_add() execution times:
Count: 5000000
Range: 0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
Percentiles: 90th: 1015.000; 95th: 1139.000; 99th: 1397.000
0.000 - 7.816: 40 |
7.816 - 37.858: 209 |
37.858 - 170.278: 6059 |
170.278 - 753.961: 2754890 #####################################################
753.961 - 3326.728: 2232312 ###########################################
3326.728 - 14667.018: 4366 |
14667.018 - 64652.943: 852 |
64652.943 - 284981.761: 550 |
284981.761 - 1256150.914: 221 |
1256150.914 - 5536887.000: 7 |
After this patch, btrfs_add_inode_to_root() execution times:
Count: 5000000
Range: 0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
Percentiles: 90th: 342.000; 95th: 432.000; 99th: 572.000
0.000 - 7.264: 104 |
7.264 - 33.145: 352 |
33.145 - 140.081: 109606 #
140.081 - 581.930: 4840090 #####################################################
581.930 - 2407.590: 43532 |
2407.590 - 9950.979: 2245 |
9950.979 - 41119.278: 514 |
41119.278 - 169902.616: 155 |
169902.616 - 702018.539: 47 |
702018.539 - 2900652.000: 9 |
Average, percentiles, standard deviation, etc, are all much better.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-24 16:58:01 +01:00
if ( WARN_ON ( ! xa_empty ( & root - > inodes ) ) )
xa_destroy ( & root - > inodes ) ;
2020-05-20 14:58:51 +08:00
WARN_ON ( test_bit ( BTRFS_ROOT_DEAD_RELOC_TREE , & root - > state ) ) ;
2020-02-14 16:11:42 -05:00
if ( root - > anon_dev )
free_anon_bdev ( root - > anon_dev ) ;
2020-06-23 17:40:07 +09:00
free_root_extent_buffers ( root ) ;
2020-02-14 16:11:42 -05:00
# ifdef CONFIG_BTRFS_DEBUG
2022-07-15 13:59:21 +02:00
spin_lock ( & root - > fs_info - > fs_roots_radix_lock ) ;
2020-02-14 16:11:42 -05:00
list_del_init ( & root - > leak_list ) ;
2022-07-15 13:59:21 +02:00
spin_unlock ( & root - > fs_info - > fs_roots_radix_lock ) ;
2020-02-14 16:11:42 -05:00
# endif
kfree ( root ) ;
}
}
2014-05-07 17:06:09 -04:00
void btrfs_free_fs_roots ( struct btrfs_fs_info * fs_info )
2013-04-24 16:35:41 -04:00
{
2022-07-15 13:59:21 +02:00
int ret ;
struct btrfs_root * gang [ 8 ] ;
int i ;
2013-04-24 16:35:41 -04:00
while ( ! list_empty ( & fs_info - > dead_roots ) ) {
2022-07-15 13:59:21 +02:00
gang [ 0 ] = list_entry ( fs_info - > dead_roots . next ,
struct btrfs_root , root_list ) ;
list_del ( & gang [ 0 ] - > root_list ) ;
2013-04-24 16:35:41 -04:00
2022-07-15 13:59:21 +02:00
if ( test_bit ( BTRFS_ROOT_IN_RADIX , & gang [ 0 ] - > state ) )
btrfs_drop_and_free_fs_root ( fs_info , gang [ 0 ] ) ;
btrfs_put_root ( gang [ 0 ] ) ;
2013-04-24 16:35:41 -04:00
}
2022-07-15 13:59:21 +02:00
while ( 1 ) {
ret = radix_tree_gang_lookup ( & fs_info - > fs_roots_radix ,
( void * * ) gang , 0 ,
ARRAY_SIZE ( gang ) ) ;
if ( ! ret )
break ;
for ( i = 0 ; i < ret ; i + + )
btrfs_drop_and_free_fs_root ( fs_info , gang [ i ] ) ;
2013-04-24 16:35:41 -04:00
}
}
2011-11-03 15:17:42 -04:00
2014-08-01 18:12:38 -05:00
static void btrfs_init_scrub ( struct btrfs_fs_info * fs_info )
{
mutex_init ( & fs_info - > scrub_lock ) ;
atomic_set ( & fs_info - > scrubs_running , 0 ) ;
atomic_set ( & fs_info - > scrub_pause_req , 0 ) ;
atomic_set ( & fs_info - > scrubs_paused , 0 ) ;
atomic_set ( & fs_info - > scrub_cancel_req , 0 ) ;
init_waitqueue_head ( & fs_info - > scrub_pause_wait ) ;
2019-01-30 14:45:02 +08:00
refcount_set ( & fs_info - > scrub_workers_refcnt , 0 ) ;
2014-08-01 18:12:38 -05:00
}
2014-08-01 18:12:39 -05:00
static void btrfs_init_balance ( struct btrfs_fs_info * fs_info )
{
spin_lock_init ( & fs_info - > balance_lock ) ;
mutex_init ( & fs_info - > balance_mutex ) ;
atomic_set ( & fs_info - > balance_pause_req , 0 ) ;
atomic_set ( & fs_info - > balance_cancel_req , 0 ) ;
fs_info - > balance_ctl = NULL ;
init_waitqueue_head ( & fs_info - > balance_wait_q ) ;
2021-05-18 00:37:36 +02:00
atomic_set ( & fs_info - > reloc_cancel_req , 0 ) ;
2014-08-01 18:12:39 -05:00
}
2023-02-19 19:10:22 +01:00
static int btrfs_init_btree_inode ( struct super_block * sb )
2014-08-01 18:12:40 -05:00
{
2023-02-19 19:10:22 +01:00
struct btrfs_fs_info * fs_info = btrfs_sb ( sb ) ;
2022-09-14 19:04:49 -04:00
unsigned long hash = btrfs_inode_hash ( BTRFS_BTREE_INODE_OBJECTID ,
fs_info - > tree_root ) ;
2023-02-19 19:10:22 +01:00
struct inode * inode ;
inode = new_inode ( sb ) ;
if ( ! inode )
return - ENOMEM ;
2016-06-22 18:54:24 -04:00
2024-05-05 13:47:02 +01:00
btrfs_set_inode_number ( BTRFS_I ( inode ) , BTRFS_BTREE_INODE_OBJECTID ) ;
2016-06-22 18:54:24 -04:00
set_nlink ( inode , 1 ) ;
2014-08-01 18:12:40 -05:00
/*
* we set the i_size on the btree inode to the max possible int .
* the real end of the address space is determined by all of
* the devices in the system
*/
2016-06-22 18:54:24 -04:00
inode - > i_size = OFFSET_MAX ;
inode - > i_mapping - > a_ops = & btree_aops ;
2023-02-19 19:10:22 +01:00
mapping_set_gfp_mask ( inode - > i_mapping , GFP_NOFS ) ;
2014-08-01 18:12:40 -05:00
2019-03-01 10:47:59 +08:00
extent_io_tree_init ( fs_info , & BTRFS_I ( inode ) - > io_tree ,
2022-10-28 02:47:06 +02:00
IO_TREE_BTREE_INODE_IO ) ;
2016-06-22 18:54:24 -04:00
extent_map_tree_init ( & BTRFS_I ( inode ) - > extent_tree ) ;
2014-08-01 18:12:40 -05:00
2020-02-14 16:11:43 -05:00
BTRFS_I ( inode ) - > root = btrfs_grab_root ( fs_info - > tree_root ) ;
2016-06-22 18:54:24 -04:00
set_bit ( BTRFS_INODE_DUMMY , & BTRFS_I ( inode ) - > runtime_flags ) ;
2022-09-14 19:04:49 -04:00
__insert_inode_hash ( inode , hash ) ;
2023-02-19 19:10:22 +01:00
fs_info - > btree_inode = inode ;
return 0 ;
2014-08-01 18:12:40 -05:00
}
2014-08-01 18:12:41 -05:00
static void btrfs_init_dev_replace_locks ( struct btrfs_fs_info * fs_info )
{
mutex_init ( & fs_info - > dev_replace . lock_finishing_cancel_unmount ) ;
2018-04-05 01:29:24 +02:00
init_rwsem ( & fs_info - > dev_replace . rwsem ) ;
2018-04-05 01:04:49 +02:00
init_waitqueue_head ( & fs_info - > dev_replace . replace_wait ) ;
2014-08-01 18:12:41 -05:00
}
2014-08-01 18:12:42 -05:00
static void btrfs_init_qgroup ( struct btrfs_fs_info * fs_info )
{
spin_lock_init ( & fs_info - > qgroup_lock ) ;
mutex_init ( & fs_info - > qgroup_ioctl_lock ) ;
fs_info - > qgroup_tree = RB_ROOT ;
INIT_LIST_HEAD ( & fs_info - > dirty_qgroups ) ;
fs_info - > qgroup_seq = 1 ;
fs_info - > qgroup_ulist = NULL ;
2016-08-15 12:10:33 -04:00
fs_info - > qgroup_rescan_running = false ;
btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()
Btrfs qgroup has a long history of bringing performance penalty in
btrfs_commit_transaction().
Although we tried our best to migrate such impact, there is still an
unsolved call site, btrfs_drop_snapshot().
This function will find the highest shared tree block and modify its
extent ownership to do a subvolume/snapshot dropping.
Such change will affect the whole subtree, and cause tons of qgroup
dirty extents and stall btrfs_commit_transaction().
To avoid such problem, here we introduce a new sysfs interface,
/sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at
whether and at which level we should skip qgroup accounting for subtree
dropping.
The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go
through qgroup accounting, to ensure qgroup numbers are kept as
consistent as possible.
While for performance sensitive cases, add a way to change the values to
more reasonable values like 3, to make any subtree, which is at or higher
than level 3, to mark qgroup inconsistent and skip the accounting.
The cost is obvious, the qgroup number is no longer consistent, but at
least performance is more reasonable, and users have the control.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 09:14:09 +08:00
fs_info - > qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL ;
2014-08-01 18:12:42 -05:00
mutex_init ( & fs_info - > qgroup_rescan_lock ) ;
}
2021-11-10 14:42:17 +08:00
static int btrfs_init_workqueues ( struct btrfs_fs_info * fs_info )
2015-02-16 16:29:26 +01:00
{
2018-02-13 17:50:42 +08:00
u32 max_active = fs_info - > thread_pool_size ;
2015-02-16 18:34:01 +01:00
unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND ;
btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND
==========
When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().
However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.
While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.
This patch series audits all call sites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
BTRFS
=====
* fs_info->scrub_workers initialized in scrub_workers_get() was setting
@max_active to 1 when @is_dev_replace is set and it seems that the
workqueue actually needs to be ordered if @is_dev_replace. Update the code
so that alloc_ordered_workqueue() is used if @is_dev_replace.
* fs_info->discard_ctl.discard_workers initialized in
btrfs_init_workqueues() was directly using alloc_workqueue() w/
@max_active==1. Converted to alloc_ordered_workqueue().
* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
which are allocated with btrfs_alloc_workqueue().
btrfs_workqueue implements automatic @max_active adjustment which is
disabled when the specified max limit is below a certain threshold, so
calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
workqueue whose @max_active won't be changed as the auto-tuning is
disabled.
This is rather brittle in that nothing clearly indicates that the two
workqueues should be ordered or btrfs_alloc_workqueue() must disable
auto-tuning when @limit_active==1.
This patch factors out the common btrfs_workqueue init code into
btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
The two workqueues are converted to use the new ordered allocation
interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-25 13:33:08 -10:00
unsigned int ordered_flags = WQ_MEM_RECLAIM | WQ_FREEZABLE ;
2015-02-16 16:29:26 +01:00
fs_info - > workers =
2022-04-18 06:43:09 +02:00
btrfs_alloc_workqueue ( fs_info , " worker " , flags , max_active , 16 ) ;
2015-02-16 16:29:26 +01:00
fs_info - > delalloc_workers =
2016-06-09 16:22:11 -04:00
btrfs_alloc_workqueue ( fs_info , " delalloc " ,
flags , max_active , 2 ) ;
2015-02-16 16:29:26 +01:00
fs_info - > flush_workers =
2016-06-09 16:22:11 -04:00
btrfs_alloc_workqueue ( fs_info , " flush_delalloc " ,
flags , max_active , 0 ) ;
2015-02-16 16:29:26 +01:00
fs_info - > caching_workers =
2016-06-09 16:22:11 -04:00
btrfs_alloc_workqueue ( fs_info , " cache " , flags , max_active , 0 ) ;
2015-02-16 16:29:26 +01:00
fs_info - > fixup_workers =
btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND
==========
When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().
However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.
While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.
This patch series audits all call sites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
BTRFS
=====
* fs_info->scrub_workers initialized in scrub_workers_get() was setting
@max_active to 1 when @is_dev_replace is set and it seems that the
workqueue actually needs to be ordered if @is_dev_replace. Update the code
so that alloc_ordered_workqueue() is used if @is_dev_replace.
* fs_info->discard_ctl.discard_workers initialized in
btrfs_init_workqueues() was directly using alloc_workqueue() w/
@max_active==1. Converted to alloc_ordered_workqueue().
* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
which are allocated with btrfs_alloc_workqueue().
btrfs_workqueue implements automatic @max_active adjustment which is
disabled when the specified max limit is below a certain threshold, so
calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
workqueue whose @max_active won't be changed as the auto-tuning is
disabled.
This is rather brittle in that nothing clearly indicates that the two
workqueues should be ordered or btrfs_alloc_workqueue() must disable
auto-tuning when @limit_active==1.
This patch factors out the common btrfs_workqueue init code into
btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
The two workqueues are converted to use the new ordered allocation
interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-25 13:33:08 -10:00
btrfs_alloc_ordered_workqueue ( fs_info , " fixup " , ordered_flags ) ;
2015-02-16 16:29:26 +01:00
fs_info - > endio_workers =
2022-05-26 09:36:40 +02:00
alloc_workqueue ( " btrfs-endio " , flags , max_active ) ;
2015-02-16 16:29:26 +01:00
fs_info - > endio_meta_workers =
2022-05-26 09:36:40 +02:00
alloc_workqueue ( " btrfs-endio-meta " , flags , max_active ) ;
2022-04-18 06:43:11 +02:00
fs_info - > rmw_workers = alloc_workqueue ( " btrfs-rmw " , flags , max_active ) ;
2015-02-16 16:29:26 +01:00
fs_info - > endio_write_workers =
2016-06-09 16:22:11 -04:00
btrfs_alloc_workqueue ( fs_info , " endio-write " , flags ,
max_active , 2 ) ;
2022-05-26 09:36:38 +02:00
fs_info - > compressed_write_workers =
alloc_workqueue ( " btrfs-compressed-write " , flags , max_active ) ;
2015-02-16 16:29:26 +01:00
fs_info - > endio_freespace_worker =
2016-06-09 16:22:11 -04:00
btrfs_alloc_workqueue ( fs_info , " freespace-write " , flags ,
max_active , 0 ) ;
2015-02-16 16:29:26 +01:00
fs_info - > delayed_workers =
2016-06-09 16:22:11 -04:00
btrfs_alloc_workqueue ( fs_info , " delayed-meta " , flags ,
max_active , 0 ) ;
2015-02-16 16:29:26 +01:00
fs_info - > qgroup_rescan_workers =
btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND
==========
When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().
However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.
While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.
This patch series audits all call sites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
BTRFS
=====
* fs_info->scrub_workers initialized in scrub_workers_get() was setting
@max_active to 1 when @is_dev_replace is set and it seems that the
workqueue actually needs to be ordered if @is_dev_replace. Update the code
so that alloc_ordered_workqueue() is used if @is_dev_replace.
* fs_info->discard_ctl.discard_workers initialized in
btrfs_init_workqueues() was directly using alloc_workqueue() w/
@max_active==1. Converted to alloc_ordered_workqueue().
* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
which are allocated with btrfs_alloc_workqueue().
btrfs_workqueue implements automatic @max_active adjustment which is
disabled when the specified max limit is below a certain threshold, so
calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
workqueue whose @max_active won't be changed as the auto-tuning is
disabled.
This is rather brittle in that nothing clearly indicates that the two
workqueues should be ordered or btrfs_alloc_workqueue() must disable
auto-tuning when @limit_active==1.
This patch factors out the common btrfs_workqueue init code into
btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
The two workqueues are converted to use the new ordered allocation
interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-25 13:33:08 -10:00
btrfs_alloc_ordered_workqueue ( fs_info , " qgroup-rescan " ,
ordered_flags ) ;
2019-12-13 16:22:14 -08:00
fs_info - > discard_ctl . discard_workers =
btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND
==========
When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().
However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.
While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.
This patch series audits all call sites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
BTRFS
=====
* fs_info->scrub_workers initialized in scrub_workers_get() was setting
@max_active to 1 when @is_dev_replace is set and it seems that the
workqueue actually needs to be ordered if @is_dev_replace. Update the code
so that alloc_ordered_workqueue() is used if @is_dev_replace.
* fs_info->discard_ctl.discard_workers initialized in
btrfs_init_workqueues() was directly using alloc_workqueue() w/
@max_active==1. Converted to alloc_ordered_workqueue().
* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
which are allocated with btrfs_alloc_workqueue().
btrfs_workqueue implements automatic @max_active adjustment which is
disabled when the specified max limit is below a certain threshold, so
calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
workqueue whose @max_active won't be changed as the auto-tuning is
disabled.
This is rather brittle in that nothing clearly indicates that the two
workqueues should be ordered or btrfs_alloc_workqueue() must disable
auto-tuning when @limit_active==1.
This patch factors out the common btrfs_workqueue init code into
btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
The two workqueues are converted to use the new ordered allocation
interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-25 13:33:08 -10:00
alloc_ordered_workqueue ( " btrfs_discard " , WQ_FREEZABLE ) ;
2015-02-16 16:29:26 +01:00
2023-05-03 09:06:15 +02:00
if ( ! ( fs_info - > workers & &
2022-04-18 06:43:09 +02:00
fs_info - > delalloc_workers & & fs_info - > flush_workers & &
2015-02-16 16:29:26 +01:00
fs_info - > endio_workers & & fs_info - > endio_meta_workers & &
2022-05-26 09:36:38 +02:00
fs_info - > compressed_write_workers & &
2022-11-01 19:16:12 +08:00
fs_info - > endio_write_workers & &
2015-02-16 16:29:26 +01:00
fs_info - > endio_freespace_worker & & fs_info - > rmw_workers & &
btrfs: remove reada infrastructure
Currently there is only one user for btrfs metadata readahead, and
that's scrub.
But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).
Despite this, there are some extra problems related to metadata
readahead:
- Duplicated feature with btrfs_path::reada
- Partly duplicated feature of btrfs_fs_info::buffer_radix
Btrfs already caches its metadata in buffer_radix, while readahead
tries to read the tree block no matter if it's already cached.
- Poor layer separation
Metadata readahead works kinda at device level.
This is definitely not the correct layer it should be, since metadata
is at btrfs logical address space, it should not bother device at all.
This brings extra chance for bugs to sneak in, while brings
unnecessary complexity.
- Dead code
In the very beginning of scrub.c we have #undef DEBUG, rendering all
the debug related code useless and unable to test.
Thus here I purpose to remove the metadata readahead mechanism
completely.
[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.
For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.
The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.
Old New Diff
SSD 455.3 466.332 +2.42%
HDD 103.927 98.012 -5.69%
Comprehensive test methodology is in the cover letter of the patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-14 21:01:45 +08:00
fs_info - > caching_workers & & fs_info - > fixup_workers & &
fs_info - > delayed_workers & & fs_info - > qgroup_rescan_workers & &
2019-12-13 16:22:14 -08:00
fs_info - > discard_ctl . discard_workers ) ) {
2015-02-16 16:29:26 +01:00
return - ENOMEM ;
}
return 0 ;
}
2019-06-03 16:58:56 +02:00
static int btrfs_init_csum_hash ( struct btrfs_fs_info * fs_info , u16 csum_type )
{
struct crypto_shash * csum_shash ;
2019-10-08 18:41:33 +02:00
const char * csum_driver = btrfs_super_csum_driver ( csum_type ) ;
2019-06-03 16:58:56 +02:00
2019-10-08 18:41:33 +02:00
csum_shash = crypto_alloc_shash ( csum_driver , 0 , 0 ) ;
2019-06-03 16:58:56 +02:00
if ( IS_ERR ( csum_shash ) ) {
btrfs_err ( fs_info , " error allocating %s hash for checksum " ,
2019-10-08 18:41:33 +02:00
csum_driver ) ;
2019-06-03 16:58:56 +02:00
return PTR_ERR ( csum_shash ) ;
}
fs_info - > csum_shash = csum_shash ;
2023-03-29 09:13:05 +09:00
/*
* Check if the checksum implementation is a fast accelerated one .
* As - is this is a bit of a hack and should be replaced once the csum
* implementations provide that information themselves .
*/
switch ( csum_type ) {
case BTRFS_CSUM_TYPE_CRC32 :
if ( ! strstr ( crypto_shash_driver_name ( csum_shash ) , " generic " ) )
set_bit ( BTRFS_FS_CSUM_IMPL_FAST , & fs_info - > flags ) ;
break ;
btrfs: add xxhash to fast checksum implementations
The implementation of XXHASH is now CPU only but still fast enough to be
considered for the synchronous checksumming, like non-generic crc32c.
A userspace benchmark comparing it to various implementations (patched
hash-speedtest from btrfs-progs):
Block size: 4096
Iterations: 1000000
Implementation: builtin
Units: CPU cycles
NULL-NOP: cycles: 73384294, cycles/i 73
NULL-MEMCPY: cycles: 228033868, cycles/i 228, 61664.320 MiB/s
CRC32C-ref: cycles: 24758559416, cycles/i 24758, 567.950 MiB/s
CRC32C-NI: cycles: 1194350470, cycles/i 1194, 11773.433 MiB/s
CRC32C-ADLERSW: cycles: 6150186216, cycles/i 6150, 2286.372 MiB/s
CRC32C-ADLERHW: cycles: 626979180, cycles/i 626, 22427.453 MiB/s
CRC32C-PCL: cycles: 466746732, cycles/i 466, 30126.699 MiB/s
XXHASH: cycles: 860656400, cycles/i 860, 16338.188 MiB/s
Comparing purely software implementation (ref), current outdated
accelerated using crc32q instruction (NI), optimized implementations by
M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775)
and the best one that was taken from kernel using the PCLMULQDQ
instruction (PCL).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-04 00:06:02 +02:00
case BTRFS_CSUM_TYPE_XXHASH :
set_bit ( BTRFS_FS_CSUM_IMPL_FAST , & fs_info - > flags ) ;
break ;
2023-03-29 09:13:05 +09:00
default :
break ;
}
2022-06-22 20:45:18 +02:00
btrfs_info ( fs_info , " using %s (%s) checksum algorithm " ,
btrfs_super_csum_name ( csum_type ) ,
crypto_shash_driver_name ( csum_shash ) ) ;
2019-06-03 16:58:56 +02:00
return 0 ;
}
2014-08-01 18:12:46 -05:00
static int btrfs_replay_log ( struct btrfs_fs_info * fs_info ,
struct btrfs_fs_devices * fs_devices )
{
int ret ;
2022-09-14 13:32:50 +08:00
struct btrfs_tree_parent_check check = { 0 } ;
2014-08-01 18:12:46 -05:00
struct btrfs_root * log_tree_root ;
struct btrfs_super_block * disk_super = fs_info - > super_copy ;
u64 bytenr = btrfs_super_log_root ( disk_super ) ;
2018-03-29 09:08:11 +08:00
int level = btrfs_super_log_root_level ( disk_super ) ;
2014-08-01 18:12:46 -05:00
if ( fs_devices - > rw_devices = = 0 ) {
2015-10-08 11:37:06 +02:00
btrfs_warn ( fs_info , " log replay required on RO media " ) ;
2014-08-01 18:12:46 -05:00
return - EIO ;
}
2020-01-24 09:32:18 -05:00
log_tree_root = btrfs_alloc_root ( fs_info , BTRFS_TREE_LOG_OBJECTID ,
GFP_KERNEL ) ;
2014-08-01 18:12:46 -05:00
if ( ! log_tree_root )
return - ENOMEM ;
2022-09-14 13:32:50 +08:00
check . level = level ;
check . transid = fs_info - > generation + 1 ;
check . owner_root = BTRFS_TREE_LOG_OBJECTID ;
log_tree_root - > node = read_tree_block ( fs_info , bytenr , & check ) ;
2015-05-25 17:30:15 +08:00
if ( IS_ERR ( log_tree_root - > node ) ) {
2015-10-08 11:37:06 +02:00
btrfs_warn ( fs_info , " failed to read log tree " ) ;
2015-06-11 14:16:44 +08:00
ret = PTR_ERR ( log_tree_root - > node ) ;
2020-02-14 16:11:42 -05:00
log_tree_root - > node = NULL ;
2020-01-24 09:33:01 -05:00
btrfs_put_root ( log_tree_root ) ;
2015-06-11 14:16:44 +08:00
return ret ;
2022-02-22 15:41:19 +08:00
}
if ( ! extent_buffer_uptodate ( log_tree_root - > node ) ) {
2015-10-08 11:37:06 +02:00
btrfs_err ( fs_info , " failed to read log tree " ) ;
2020-01-24 09:33:01 -05:00
btrfs_put_root ( log_tree_root ) ;
2014-08-01 18:12:46 -05:00
return - EIO ;
}
2022-02-22 15:41:19 +08:00
2014-08-01 18:12:46 -05:00
/* returns with log_tree_root freed on success */
ret = btrfs_recover_log_trees ( log_tree_root ) ;
if ( ret ) {
2016-06-22 18:54:23 -04:00
btrfs_handle_fs_error ( fs_info , ret ,
" Failed to recover log tree " ) ;
2020-01-24 09:33:01 -05:00
btrfs_put_root ( log_tree_root ) ;
2014-08-01 18:12:46 -05:00
return ret ;
}
2017-07-17 08:45:34 +01:00
if ( sb_rdonly ( fs_info - > sb ) ) {
2016-06-21 21:16:51 -04:00
ret = btrfs_commit_super ( fs_info ) ;
2014-08-01 18:12:46 -05:00
if ( ret )
return ret ;
}
return 0 ;
}
2021-11-05 16:45:51 -04:00
static int load_global_roots_objectid ( struct btrfs_root * tree_root ,
struct btrfs_path * path , u64 objectid ,
const char * name )
{
struct btrfs_fs_info * fs_info = tree_root - > fs_info ;
struct btrfs_root * root ;
2021-12-15 15:40:08 -05:00
u64 max_global_id = 0 ;
2021-11-05 16:45:51 -04:00
int ret ;
struct btrfs_key key = {
. objectid = objectid ,
. type = BTRFS_ROOT_ITEM_KEY ,
. offset = 0 ,
} ;
bool found = false ;
/* If we have IGNOREDATACSUMS skip loading these roots. */
if ( objectid = = BTRFS_CSUM_TREE_OBJECTID & &
btrfs_test_opt ( fs_info , IGNOREDATACSUMS ) ) {
2024-06-14 13:52:30 +09:30
set_bit ( BTRFS_FS_STATE_NO_DATA_CSUMS , & fs_info - > fs_state ) ;
2021-11-05 16:45:51 -04:00
return 0 ;
}
while ( 1 ) {
ret = btrfs_search_slot ( NULL , tree_root , & key , path , 0 , 0 ) ;
if ( ret < 0 )
break ;
if ( path - > slots [ 0 ] > = btrfs_header_nritems ( path - > nodes [ 0 ] ) ) {
ret = btrfs_next_leaf ( tree_root , path ) ;
if ( ret ) {
if ( ret > 0 )
ret = 0 ;
break ;
}
}
ret = 0 ;
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & key , path - > slots [ 0 ] ) ;
if ( key . objectid ! = objectid )
break ;
btrfs_release_path ( path ) ;
2021-12-15 15:40:08 -05:00
/*
* Just worry about this for extent tree , it ' ll be the same for
* everybody .
*/
if ( objectid = = BTRFS_EXTENT_TREE_OBJECTID )
max_global_id = max ( max_global_id , key . offset ) ;
2021-11-05 16:45:51 -04:00
found = true ;
root = read_tree_root_path ( tree_root , path , & key ) ;
if ( IS_ERR ( root ) ) {
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) )
ret = PTR_ERR ( root ) ;
break ;
}
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
ret = btrfs_global_root_insert ( root ) ;
if ( ret ) {
btrfs_put_root ( root ) ;
break ;
}
key . offset + + ;
}
btrfs_release_path ( path ) ;
2021-12-15 15:40:08 -05:00
if ( objectid = = BTRFS_EXTENT_TREE_OBJECTID )
fs_info - > nr_global_roots = max_global_id + 1 ;
2021-11-05 16:45:51 -04:00
if ( ! found | | ret ) {
if ( objectid = = BTRFS_CSUM_TREE_OBJECTID )
2024-06-14 13:52:30 +09:30
set_bit ( BTRFS_FS_STATE_NO_DATA_CSUMS , & fs_info - > fs_state ) ;
2021-11-05 16:45:51 -04:00
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) )
ret = ret ? ret : - ENOENT ;
else
ret = 0 ;
btrfs_err ( fs_info , " failed to load root %s " , name ) ;
}
return ret ;
}
static int load_global_roots ( struct btrfs_root * tree_root )
{
struct btrfs_path * path ;
int ret = 0 ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
ret = load_global_roots_objectid ( tree_root , path ,
BTRFS_EXTENT_TREE_OBJECTID , " extent " ) ;
if ( ret )
goto out ;
ret = load_global_roots_objectid ( tree_root , path ,
BTRFS_CSUM_TREE_OBJECTID , " csum " ) ;
if ( ret )
goto out ;
if ( ! btrfs_fs_compat_ro ( tree_root - > fs_info , FREE_SPACE_TREE ) )
goto out ;
ret = load_global_roots_objectid ( tree_root , path ,
BTRFS_FREE_SPACE_TREE_OBJECTID ,
" free space " ) ;
out :
btrfs_free_path ( path ) ;
return ret ;
}
2016-06-21 21:16:51 -04:00
static int btrfs_read_roots ( struct btrfs_fs_info * fs_info )
2014-08-01 18:12:45 -05:00
{
2016-06-21 21:16:51 -04:00
struct btrfs_root * tree_root = fs_info - > tree_root ;
2015-02-16 18:44:34 +01:00
struct btrfs_root * root ;
2014-08-01 18:12:45 -05:00
struct btrfs_key location ;
int ret ;
2024-01-24 01:09:46 +01:00
ASSERT ( fs_info - > tree_root ) ;
2016-06-21 21:16:51 -04:00
2021-11-05 16:45:51 -04:00
ret = load_global_roots ( tree_root ) ;
if ( ret )
return ret ;
2014-08-01 18:12:45 -05:00
location . type = BTRFS_ROOT_ITEM_KEY ;
location . offset = 0 ;
2022-08-09 13:02:18 +08:00
if ( btrfs_fs_compat_ro ( fs_info , BLOCK_GROUP_TREE ) ) {
2022-08-09 13:02:17 +08:00
location . objectid = BTRFS_BLOCK_GROUP_TREE_OBJECTID ;
root = btrfs_read_tree_root ( tree_root , & location ) ;
if ( IS_ERR ( root ) ) {
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) ) {
ret = PTR_ERR ( root ) ;
goto out ;
}
} else {
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
fs_info - > block_group_root = root ;
}
}
location . objectid = BTRFS_DEV_TREE_OBJECTID ;
2015-02-16 18:44:34 +01:00
root = btrfs_read_tree_root ( tree_root , & location ) ;
2018-03-29 06:11:45 +08:00
if ( IS_ERR ( root ) ) {
2020-10-16 11:29:18 -04:00
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) ) {
ret = PTR_ERR ( root ) ;
goto out ;
}
} else {
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
fs_info - > dev_root = root ;
2018-03-29 06:11:45 +08:00
}
2021-03-11 11:23:14 -05:00
/* Initialize fs_info for all devices in any case */
2022-11-04 07:12:34 -07:00
ret = btrfs_init_devices_late ( fs_info ) ;
if ( ret )
goto out ;
2014-08-01 18:12:45 -05:00
2020-05-15 14:01:42 +08:00
/*
* This tree can share blocks with some other fs tree during relocation
* and we need a proper setup by btrfs_get_fs_root
*/
2020-05-15 19:35:55 +02:00
root = btrfs_get_fs_root ( tree_root - > fs_info ,
BTRFS_DATA_RELOC_TREE_OBJECTID , true ) ;
2020-05-15 14:01:42 +08:00
if ( IS_ERR ( root ) ) {
2020-10-16 11:29:18 -04:00
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) ) {
ret = PTR_ERR ( root ) ;
goto out ;
}
} else {
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
fs_info - > data_reloc_root = root ;
2020-05-15 14:01:42 +08:00
}
2014-08-01 18:12:45 -05:00
location . objectid = BTRFS_QUOTA_TREE_OBJECTID ;
2015-02-16 18:44:34 +01:00
root = btrfs_read_tree_root ( tree_root , & location ) ;
if ( ! IS_ERR ( root ) ) {
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
fs_info - > quota_root = root ;
2014-08-01 18:12:45 -05:00
}
location . objectid = BTRFS_UUID_TREE_OBJECTID ;
2015-02-16 18:44:34 +01:00
root = btrfs_read_tree_root ( tree_root , & location ) ;
if ( IS_ERR ( root ) ) {
2020-10-16 11:29:18 -04:00
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) ) {
ret = PTR_ERR ( root ) ;
if ( ret ! = - ENOENT )
goto out ;
}
2014-08-01 18:12:45 -05:00
} else {
2015-02-16 18:44:34 +01:00
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
fs_info - > uuid_root = root ;
2014-08-01 18:12:45 -05:00
}
2023-09-14 09:06:57 -07:00
if ( btrfs_fs_incompat ( fs_info , RAID_STRIPE_TREE ) ) {
location . objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID ;
root = btrfs_read_tree_root ( tree_root , & location ) ;
if ( IS_ERR ( root ) ) {
if ( ! btrfs_test_opt ( fs_info , IGNOREBADROOTS ) ) {
ret = PTR_ERR ( root ) ;
goto out ;
}
} else {
set_bit ( BTRFS_ROOT_TRACK_DIRTY , & root - > state ) ;
fs_info - > stripe_root = root ;
}
}
2014-08-01 18:12:45 -05:00
return 0 ;
2018-03-29 06:11:45 +08:00
out :
btrfs_warn ( fs_info , " failed to read root (objectid=%llu): %d " ,
location . objectid , ret ) ;
return ret ;
2014-08-01 18:12:45 -05:00
}
2018-05-11 13:35:26 +08:00
/*
* Real super block validation
* NOTE : super csum type and incompat features will not be checked here .
*
* @ sb : super block to check
* @ mirror_num : the super block number to check its bytenr :
* 0 the primary ( 1 st ) sb
* 1 , 2 2 nd and 3 rd backup copy
* - 1 skip bytenr check
*/
2024-05-30 19:14:12 +02:00
int btrfs_validate_super ( const struct btrfs_fs_info * fs_info ,
const struct btrfs_super_block * sb , int mirror_num )
2018-05-11 13:35:25 +08:00
{
u64 nodesize = btrfs_super_nodesize ( sb ) ;
u64 sectorsize = btrfs_super_sectorsize ( sb ) ;
int ret = 0 ;
2024-06-14 13:52:31 +09:30
const bool ignore_flags = btrfs_test_opt ( fs_info , IGNORESUPERFLAGS ) ;
2018-05-11 13:35:25 +08:00
if ( btrfs_super_magic ( sb ) ! = BTRFS_MAGIC ) {
btrfs_err ( fs_info , " no valid FS found " ) ;
ret = - EINVAL ;
}
2024-06-14 13:52:31 +09:30
if ( ( btrfs_super_flags ( sb ) & ~ BTRFS_SUPER_FLAG_SUPP ) ) {
if ( ! ignore_flags ) {
btrfs_err ( fs_info ,
" unrecognized or unsupported super flag 0x%llx " ,
btrfs_super_flags ( sb ) & ~ BTRFS_SUPER_FLAG_SUPP ) ;
ret = - EINVAL ;
} else {
btrfs_info ( fs_info ,
" unrecognized or unsupported super flags: 0x%llx, ignored " ,
btrfs_super_flags ( sb ) & ~ BTRFS_SUPER_FLAG_SUPP ) ;
}
2018-05-11 13:35:25 +08:00
}
if ( btrfs_super_root_level ( sb ) > = BTRFS_MAX_LEVEL ) {
btrfs_err ( fs_info , " tree_root level too big: %d >= %d " ,
btrfs_super_root_level ( sb ) , BTRFS_MAX_LEVEL ) ;
ret = - EINVAL ;
}
if ( btrfs_super_chunk_root_level ( sb ) > = BTRFS_MAX_LEVEL ) {
btrfs_err ( fs_info , " chunk_root level too big: %d >= %d " ,
btrfs_super_chunk_root_level ( sb ) , BTRFS_MAX_LEVEL ) ;
ret = - EINVAL ;
}
if ( btrfs_super_log_root_level ( sb ) > = BTRFS_MAX_LEVEL ) {
btrfs_err ( fs_info , " log_root level too big: %d >= %d " ,
btrfs_super_log_root_level ( sb ) , BTRFS_MAX_LEVEL ) ;
ret = - EINVAL ;
}
/*
* Check sectorsize and nodesize first , other check will need it .
* Check all possible sectorsize ( 4 K , 8 K , 16 K , 32 K , 64 K ) here .
*/
if ( ! is_power_of_2 ( sectorsize ) | | sectorsize < 4096 | |
sectorsize > BTRFS_MAX_METADATA_BLOCKSIZE ) {
btrfs_err ( fs_info , " invalid sectorsize %llu " , sectorsize ) ;
ret = - EINVAL ;
}
2021-01-26 16:34:02 +08:00
/*
2022-01-13 13:22:10 +08:00
* We only support at most two sectorsizes : 4 K and PAGE_SIZE .
*
* We can support 16 K sectorsize with 64 K page size without problem ,
* but such sectorsize / pagesize combination doesn ' t make much sense .
* 4 K will be our future standard , PAGE_SIZE is supported from the very
* beginning .
2021-01-26 16:34:02 +08:00
*/
2022-01-13 13:22:10 +08:00
if ( sectorsize > PAGE_SIZE | | ( sectorsize ! = SZ_4K & & sectorsize ! = PAGE_SIZE ) ) {
2018-05-11 13:35:25 +08:00
btrfs_err ( fs_info ,
2021-01-26 16:34:02 +08:00
" sectorsize %llu not yet supported for page size %lu " ,
2018-05-11 13:35:25 +08:00
sectorsize , PAGE_SIZE ) ;
ret = - EINVAL ;
}
2021-01-26 16:34:02 +08:00
2018-05-11 13:35:25 +08:00
if ( ! is_power_of_2 ( nodesize ) | | nodesize < sectorsize | |
nodesize > BTRFS_MAX_METADATA_BLOCKSIZE ) {
btrfs_err ( fs_info , " invalid nodesize %llu " , nodesize ) ;
ret = - EINVAL ;
}
if ( nodesize ! = le32_to_cpu ( sb - > __unused_leafsize ) ) {
btrfs_err ( fs_info , " invalid leafsize %u, should be %llu " ,
le32_to_cpu ( sb - > __unused_leafsize ) , nodesize ) ;
ret = - EINVAL ;
}
/* Root alignment check */
if ( ! IS_ALIGNED ( btrfs_super_root ( sb ) , sectorsize ) ) {
btrfs_warn ( fs_info , " tree_root block unaligned: %llu " ,
btrfs_super_root ( sb ) ) ;
ret = - EINVAL ;
}
if ( ! IS_ALIGNED ( btrfs_super_chunk_root ( sb ) , sectorsize ) ) {
btrfs_warn ( fs_info , " chunk_root block unaligned: %llu " ,
btrfs_super_chunk_root ( sb ) ) ;
ret = - EINVAL ;
}
if ( ! IS_ALIGNED ( btrfs_super_log_root ( sb ) , sectorsize ) ) {
btrfs_warn ( fs_info , " log_root block unaligned: %llu " ,
btrfs_super_log_root ( sb ) ) ;
ret = - EINVAL ;
}
2023-09-28 09:09:47 +08:00
if ( ! fs_info - > fs_devices - > temp_fsid & &
memcmp ( fs_info - > fs_devices - > fsid , sb - > fsid , BTRFS_FSID_SIZE ) ! = 0 ) {
2021-05-31 12:26:01 +03:00
btrfs_err ( fs_info ,
" superblock fsid doesn't match fsid of fs_devices: %pU != %pU " ,
btrfs: use the correct superblock to compare fsid in btrfs_validate_super
The function btrfs_validate_super() should verify the fsid in the provided
superblock argument. Because, all its callers expect it to do that.
Such as in the following stack:
write_all_supers()
sb = fs_info->super_for_commit;
btrfs_validate_write_super(.., sb)
btrfs_validate_super(.., sb, ..)
scrub_one_super()
btrfs_validate_super(.., sb, ..)
And
check_dev_super()
btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::fsid instead,
which is not correct. Fix this using the correct fsid in the superblock
argument.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-31 19:16:34 +08:00
sb - > fsid , fs_info - > fs_devices - > fsid ) ;
2021-05-31 12:26:01 +03:00
ret = - EINVAL ;
}
2023-07-31 19:16:35 +08:00
if ( memcmp ( fs_info - > fs_devices - > metadata_uuid , btrfs_sb_fsid_ptr ( sb ) ,
BTRFS_FSID_SIZE ) ! = 0 ) {
2021-05-31 12:26:01 +03:00
btrfs_err ( fs_info ,
" superblock metadata_uuid doesn't match metadata uuid of fs_devices: %pU != %pU " ,
2023-07-31 19:16:35 +08:00
btrfs_sb_fsid_ptr ( sb ) , fs_info - > fs_devices - > metadata_uuid ) ;
2021-05-31 12:26:01 +03:00
ret = - EINVAL ;
}
2023-05-24 20:02:42 +08:00
if ( memcmp ( fs_info - > fs_devices - > metadata_uuid , sb - > dev_item . fsid ,
BTRFS_FSID_SIZE ) ! = 0 ) {
btrfs_err ( fs_info ,
" dev_item UUID does not match metadata fsid: %pU != %pU " ,
fs_info - > fs_devices - > metadata_uuid , sb - > dev_item . fsid ) ;
ret = - EINVAL ;
}
2022-08-09 13:02:18 +08:00
/*
* Artificial requirement for block - group - tree to force newer features
* ( free - space - tree , no - holes ) so the test matrix is smaller .
*/
if ( btrfs_fs_compat_ro ( fs_info , BLOCK_GROUP_TREE ) & &
( ! btrfs_fs_compat_ro ( fs_info , FREE_SPACE_TREE_VALID ) | |
! btrfs_fs_incompat ( fs_info , NO_HOLES ) ) ) {
btrfs_err ( fs_info ,
2024-06-20 16:04:51 +01:00
" block-group-tree feature requires free-space-tree and no-holes " ) ;
2022-08-09 13:02:18 +08:00
ret = - EINVAL ;
}
2018-05-11 13:35:25 +08:00
/*
* Hint to catch really bogus numbers , bitflips or so , more exact checks are
* done later
*/
if ( btrfs_super_bytes_used ( sb ) < 6 * btrfs_super_nodesize ( sb ) ) {
btrfs_err ( fs_info , " bytes_used is too small %llu " ,
btrfs_super_bytes_used ( sb ) ) ;
ret = - EINVAL ;
}
if ( ! is_power_of_2 ( btrfs_super_stripesize ( sb ) ) ) {
btrfs_err ( fs_info , " invalid stripesize %u " ,
btrfs_super_stripesize ( sb ) ) ;
ret = - EINVAL ;
}
if ( btrfs_super_num_devices ( sb ) > ( 1UL < < 31 ) )
btrfs_warn ( fs_info , " suspicious number of devices: %llu " ,
btrfs_super_num_devices ( sb ) ) ;
if ( btrfs_super_num_devices ( sb ) = = 0 ) {
btrfs_err ( fs_info , " number of devices is 0 " ) ;
ret = - EINVAL ;
}
2018-05-11 13:35:26 +08:00
if ( mirror_num > = 0 & &
btrfs_super_bytenr ( sb ) ! = btrfs_sb_offset ( mirror_num ) ) {
2018-05-11 13:35:25 +08:00
btrfs_err ( fs_info , " super offset mismatch %llu != %u " ,
btrfs_super_bytenr ( sb ) , BTRFS_SUPER_INFO_OFFSET ) ;
ret = - EINVAL ;
}
/*
* Obvious sys_chunk_array corruptions , it must hold at least one key
* and one chunk
*/
if ( btrfs_super_sys_array_size ( sb ) > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE ) {
btrfs_err ( fs_info , " system chunk array too big %u > %u " ,
btrfs_super_sys_array_size ( sb ) ,
BTRFS_SYSTEM_CHUNK_ARRAY_SIZE ) ;
ret = - EINVAL ;
}
if ( btrfs_super_sys_array_size ( sb ) < sizeof ( struct btrfs_disk_key )
+ sizeof ( struct btrfs_chunk ) ) {
btrfs_err ( fs_info , " system chunk array too small %u < %zu " ,
btrfs_super_sys_array_size ( sb ) ,
sizeof ( struct btrfs_disk_key )
+ sizeof ( struct btrfs_chunk ) ) ;
ret = - EINVAL ;
}
/*
* The generation is a global counter , we ' ll trust it more than the others
* but it ' s still possible that it ' s the one that ' s wrong .
*/
if ( btrfs_super_generation ( sb ) < btrfs_super_chunk_root_generation ( sb ) )
btrfs_warn ( fs_info ,
" suspicious: generation < chunk_root_generation: %llu < %llu " ,
btrfs_super_generation ( sb ) ,
btrfs_super_chunk_root_generation ( sb ) ) ;
if ( btrfs_super_generation ( sb ) < btrfs_super_cache_generation ( sb )
& & btrfs_super_cache_generation ( sb ) ! = ( u64 ) - 1 )
btrfs_warn ( fs_info ,
" suspicious: generation < cache_generation: %llu < %llu " ,
btrfs_super_generation ( sb ) ,
btrfs_super_cache_generation ( sb ) ) ;
return ret ;
}
2018-05-11 13:35:26 +08:00
/*
* Validation of super block at mount time .
* Some checks already done early at mount time , like csum type and incompat
* flags will be skipped .
*/
static int btrfs_validate_mount_super ( struct btrfs_fs_info * fs_info )
{
btrfs: check superblock to ensure the fs was not modified at thaw time
[BACKGROUND]
There is an incident report that, one user hibernated the system, with
one btrfs on removable device still mounted.
Then by some incident, the btrfs got mounted and modified by another
system/OS, then back to the hibernated system.
After resuming from the hibernation, new write happened into the victim btrfs.
Now the fs is completely broken, since the underlying btrfs is no longer
the same one before the hibernation, and the user lost their data due to
various transid mismatch.
[REPRODUCER]
We can emulate the situation using the following small script:
truncate -s 1G $dev
mkfs.btrfs -f $dev
mount $dev $mnt
fsstress -w -d $mnt -n 500
sync
xfs_freeze -f $mnt
cp $dev $dev.backup
# There is no way to mount the same cloned fs on the same system,
# as the conflicting fsid will be rejected by btrfs.
# Thus here we have to wipe the fs using a different btrfs.
mkfs.btrfs -f $dev.backup
dd if=$dev.backup of=$dev bs=1M
xfs_freeze -u $mnt
fsstress -w -d $mnt -n 20
umount $mnt
btrfs check $dev
The final fsck will fail due to some tree blocks has incorrect fsid.
This is enough to emulate the problem hit by the unfortunate user.
[ENHANCEMENT]
Although such case should not be that common, it can still happen from
time to time.
From the view of btrfs, we can detect any unexpected super block change,
and if there is any unexpected change, we just mark the fs read-only,
and thaw the fs.
By this we can limit the damage to minimal, and I hope no one would lose
their data by this anymore.
Suggested-by: Goffredo Baroncelli <kreijack@libero.it>
Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 20:16:22 +08:00
return btrfs_validate_super ( fs_info , fs_info - > super_copy , 0 ) ;
2018-05-11 13:35:26 +08:00
}
btrfs: Do super block verification before writing it to disk
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-11 13:35:27 +08:00
/*
* Validation of super block at write time .
* Some checks like bytenr check will be skipped as their values will be
* overwritten soon .
* Extra checks like csum type and incompat flags will be done here .
*/
static int btrfs_validate_write_super ( struct btrfs_fs_info * fs_info ,
struct btrfs_super_block * sb )
{
int ret ;
btrfs: check superblock to ensure the fs was not modified at thaw time
[BACKGROUND]
There is an incident report that, one user hibernated the system, with
one btrfs on removable device still mounted.
Then by some incident, the btrfs got mounted and modified by another
system/OS, then back to the hibernated system.
After resuming from the hibernation, new write happened into the victim btrfs.
Now the fs is completely broken, since the underlying btrfs is no longer
the same one before the hibernation, and the user lost their data due to
various transid mismatch.
[REPRODUCER]
We can emulate the situation using the following small script:
truncate -s 1G $dev
mkfs.btrfs -f $dev
mount $dev $mnt
fsstress -w -d $mnt -n 500
sync
xfs_freeze -f $mnt
cp $dev $dev.backup
# There is no way to mount the same cloned fs on the same system,
# as the conflicting fsid will be rejected by btrfs.
# Thus here we have to wipe the fs using a different btrfs.
mkfs.btrfs -f $dev.backup
dd if=$dev.backup of=$dev bs=1M
xfs_freeze -u $mnt
fsstress -w -d $mnt -n 20
umount $mnt
btrfs check $dev
The final fsck will fail due to some tree blocks has incorrect fsid.
This is enough to emulate the problem hit by the unfortunate user.
[ENHANCEMENT]
Although such case should not be that common, it can still happen from
time to time.
From the view of btrfs, we can detect any unexpected super block change,
and if there is any unexpected change, we just mark the fs read-only,
and thaw the fs.
By this we can limit the damage to minimal, and I hope no one would lose
their data by this anymore.
Suggested-by: Goffredo Baroncelli <kreijack@libero.it>
Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 20:16:22 +08:00
ret = btrfs_validate_super ( fs_info , sb , - 1 ) ;
btrfs: Do super block verification before writing it to disk
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-11 13:35:27 +08:00
if ( ret < 0 )
goto out ;
2019-06-03 16:58:53 +02:00
if ( ! btrfs_supported_super_csum ( btrfs_super_csum_type ( sb ) ) ) {
btrfs: Do super block verification before writing it to disk
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-11 13:35:27 +08:00
ret = - EUCLEAN ;
btrfs_err ( fs_info , " invalid csum type, has %u want %u " ,
btrfs_super_csum_type ( sb ) , BTRFS_CSUM_TYPE_CRC32 ) ;
goto out ;
}
if ( btrfs_super_incompat_flags ( sb ) & ~ BTRFS_FEATURE_INCOMPAT_SUPP ) {
ret = - EUCLEAN ;
btrfs_err ( fs_info ,
" invalid incompat flags, has 0x%llx valid mask 0x%llx " ,
btrfs_super_incompat_flags ( sb ) ,
( unsigned long long ) BTRFS_FEATURE_INCOMPAT_SUPP ) ;
goto out ;
}
out :
if ( ret < 0 )
btrfs_err ( fs_info ,
" super block corruption detected before writing it to disk " ) ;
return ret ;
}
2021-12-15 15:40:06 -05:00
static int load_super_root ( struct btrfs_root * root , u64 bytenr , u64 gen , int level )
{
2022-09-14 13:32:50 +08:00
struct btrfs_tree_parent_check check = {
. level = level ,
. transid = gen ,
2024-04-15 16:16:23 -04:00
. owner_root = btrfs_root_id ( root )
2022-09-14 13:32:50 +08:00
} ;
2021-12-15 15:40:06 -05:00
int ret = 0 ;
2022-09-14 13:32:50 +08:00
root - > node = read_tree_block ( root - > fs_info , bytenr , & check ) ;
2021-12-15 15:40:06 -05:00
if ( IS_ERR ( root - > node ) ) {
ret = PTR_ERR ( root - > node ) ;
root - > node = NULL ;
2022-02-22 15:41:19 +08:00
return ret ;
}
if ( ! extent_buffer_uptodate ( root - > node ) ) {
2021-12-15 15:40:06 -05:00
free_extent_buffer ( root - > node ) ;
root - > node = NULL ;
2022-02-22 15:41:19 +08:00
return - EIO ;
2021-12-15 15:40:06 -05:00
}
btrfs_set_root_node ( & root - > root_item , root - > node ) ;
root - > commit_root = btrfs_root_node ( root ) ;
btrfs_set_root_refs ( & root - > root_item , 1 ) ;
return ret ;
}
static int load_important_roots ( struct btrfs_fs_info * fs_info )
{
struct btrfs_super_block * sb = fs_info - > super_copy ;
u64 gen , bytenr ;
int level , ret ;
bytenr = btrfs_super_root ( sb ) ;
gen = btrfs_super_generation ( sb ) ;
level = btrfs_super_root_level ( sb ) ;
ret = load_super_root ( fs_info - > tree_root , bytenr , gen , level ) ;
2021-12-15 15:40:07 -05:00
if ( ret ) {
2021-12-15 15:40:06 -05:00
btrfs_warn ( fs_info , " couldn't read tree root " ) ;
2021-12-15 15:40:07 -05:00
return ret ;
}
2022-08-09 13:02:17 +08:00
return 0 ;
2021-12-15 15:40:06 -05:00
}
2019-10-15 18:42:24 +03:00
static int __cold init_tree_roots ( struct btrfs_fs_info * fs_info )
2019-10-15 18:42:20 +03:00
{
2019-10-15 18:42:24 +03:00
int backup_index = find_newest_super_backup ( fs_info ) ;
2019-10-15 18:42:20 +03:00
struct btrfs_super_block * sb = fs_info - > super_copy ;
struct btrfs_root * tree_root = fs_info - > tree_root ;
bool handle_error = false ;
int ret = 0 ;
int i ;
for ( i = 0 ; i < BTRFS_NUM_BACKUP_ROOTS ; i + + ) {
if ( handle_error ) {
if ( ! IS_ERR ( tree_root - > node ) )
free_extent_buffer ( tree_root - > node ) ;
tree_root - > node = NULL ;
if ( ! btrfs_test_opt ( fs_info , USEBACKUPROOT ) )
break ;
free_root_pointers ( fs_info , 0 ) ;
/*
* Don ' t use the log in recovery mode , it won ' t be
* valid
*/
btrfs_set_super_log_root ( sb , 0 ) ;
btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.
The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0
BTRFS error (device loop0: state C): failed to load root csum
assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3664!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
load_global_roots fs/btrfs/disk-io.c:2501 [inline]
btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
legacy_get_tree+0xea/0x180 fs/fs_context.c:610
vfs_get_tree+0x88/0x270 fs/super.c:1530
fc_mount fs/namespace.c:1043 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884
[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.
So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.
And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().
But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.
So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.
[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.
To make further debugging easier, also add two explicit error messages:
- Error message for conflicting global roots
- Error message when using backup roots slot
Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-11 08:09:13 +08:00
btrfs_warn ( fs_info , " try to load backup roots slot %d " , i ) ;
2019-10-15 18:42:20 +03:00
ret = read_backup_root ( fs_info , i ) ;
2019-10-15 18:42:24 +03:00
backup_index = ret ;
2019-10-15 18:42:20 +03:00
if ( ret < 0 )
return ret ;
}
2021-12-15 15:40:06 -05:00
ret = load_important_roots ( fs_info ) ;
if ( ret ) {
2020-08-12 16:16:35 +03:00
handle_error = true ;
2019-10-15 18:42:20 +03:00
continue ;
}
2019-10-15 18:42:21 +03:00
/*
* No need to hold btrfs_root : : objectid_mutex since the fs
* hasn ' t been fully initialised and we are the only user
*/
2020-12-07 17:32:32 +02:00
ret = btrfs_init_root_free_objectid ( tree_root ) ;
2019-10-15 18:42:20 +03:00
if ( ret < 0 ) {
handle_error = true ;
continue ;
}
2020-12-07 17:32:35 +02:00
ASSERT ( tree_root - > free_objectid < = BTRFS_LAST_FREE_OBJECTID ) ;
2019-10-15 18:42:20 +03:00
ret = btrfs_read_roots ( fs_info ) ;
if ( ret < 0 ) {
handle_error = true ;
continue ;
}
/* All successful */
2021-12-15 15:40:06 -05:00
fs_info - > generation = btrfs_header_generation ( tree_root - > node ) ;
2023-10-04 11:38:51 +01:00
btrfs_set_last_trans_committed ( fs_info , fs_info - > generation ) ;
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 12:03:38 +00:00
fs_info - > last_reloc_trans = 0 ;
2019-10-15 18:42:24 +03:00
/* Always begin writing backup roots after the one being used */
if ( backup_index < 0 ) {
fs_info - > backup_root_index = 0 ;
} else {
fs_info - > backup_root_index = backup_index + 1 ;
fs_info - > backup_root_index % = BTRFS_NUM_BACKUP_ROOTS ;
}
2019-10-15 18:42:20 +03:00
break ;
}
return ret ;
}
2020-01-24 09:32:59 -05:00
void btrfs_init_fs_info ( struct btrfs_fs_info * fs_info )
2007-03-21 11:12:56 -04:00
{
2022-07-15 13:59:21 +02:00
INIT_RADIX_TREE ( & fs_info - > fs_roots_radix , GFP_ATOMIC ) ;
2022-07-15 13:59:31 +02:00
INIT_RADIX_TREE ( & fs_info - > buffer_radix , GFP_ATOMIC ) ;
2007-04-19 21:01:03 -04:00
INIT_LIST_HEAD ( & fs_info - > trans_list ) ;
2007-06-08 18:11:48 -04:00
INIT_LIST_HEAD ( & fs_info - > dead_roots ) ;
2009-11-12 09:36:34 +00:00
INIT_LIST_HEAD ( & fs_info - > delayed_iputs ) ;
2013-05-15 07:48:22 +00:00
INIT_LIST_HEAD ( & fs_info - > delalloc_roots ) ;
2009-09-11 16:11:19 -04:00
INIT_LIST_HEAD ( & fs_info - > caching_block_groups ) ;
2013-05-15 07:48:22 +00:00
spin_lock_init ( & fs_info - > delalloc_root_lock ) ;
2011-04-11 17:25:13 -04:00
spin_lock_init ( & fs_info - > trans_lock ) ;
2022-07-15 13:59:21 +02:00
spin_lock_init ( & fs_info - > fs_roots_radix_lock ) ;
2009-11-12 09:36:34 +00:00
spin_lock_init ( & fs_info - > delayed_iput_lock ) ;
2011-05-24 15:35:30 -04:00
spin_lock_init ( & fs_info - > defrag_inodes_lock ) ;
2013-04-11 10:30:16 +00:00
spin_lock_init ( & fs_info - > super_lock ) ;
2013-12-16 13:24:27 -05:00
spin_lock_init ( & fs_info - > buffer_lock ) ;
2014-09-18 11:20:02 -04:00
spin_lock_init ( & fs_info - > unused_bgs_lock ) ;
2021-02-04 19:22:18 +09:00
spin_lock_init ( & fs_info - > treelog_bg_lock ) ;
2021-08-19 21:19:17 +09:00
spin_lock_init ( & fs_info - > zone_active_bgs_lock ) ;
2021-09-09 01:19:26 +09:00
spin_lock_init ( & fs_info - > relocation_bg_lock ) ;
2012-05-16 17:55:38 +02:00
rwlock_init ( & fs_info - > tree_mod_log_lock ) ;
2021-11-05 16:45:51 -04:00
rwlock_init ( & fs_info - > global_root_lock ) ;
2015-02-26 10:49:20 +08:00
mutex_init ( & fs_info - > unused_bg_unpin_mutex ) ;
2021-04-19 16:41:01 +09:00
mutex_init ( & fs_info - > reclaim_bgs_lock ) ;
2011-06-13 20:00:16 -04:00
mutex_init ( & fs_info - > reloc_mutex ) ;
2014-03-06 13:55:03 +08:00
mutex_init ( & fs_info - > delalloc_root_mutex ) ;
2021-02-04 19:22:08 +09:00
mutex_init ( & fs_info - > zoned_meta_io_lock ) ;
2022-04-18 16:15:03 +09:00
mutex_init ( & fs_info - > zoned_data_reloc_io_lock ) ;
2013-01-29 10:13:12 +00:00
seqlock_init ( & fs_info - > profiles_lock ) ;
2007-10-15 16:19:22 -04:00
2022-07-25 15:11:48 -07:00
btrfs_lockdep_init_map ( fs_info , btrfs_trans_num_writers ) ;
2022-07-25 15:11:50 -07:00
btrfs_lockdep_init_map ( fs_info , btrfs_trans_num_extwriters ) ;
2022-07-25 15:11:54 -07:00
btrfs_lockdep_init_map ( fs_info , btrfs_trans_pending_ordered ) ;
2022-07-25 15:11:59 -07:00
btrfs_lockdep_init_map ( fs_info , btrfs_ordered_extent ) ;
2023-08-24 16:59:22 -04:00
btrfs_state_lockdep_init_map ( fs_info , btrfs_trans_commit_prep ,
BTRFS_LOCKDEP_TRANS_COMMIT_PREP ) ;
2022-07-25 15:11:52 -07:00
btrfs_state_lockdep_init_map ( fs_info , btrfs_trans_unblocked ,
BTRFS_LOCKDEP_TRANS_UNBLOCKED ) ;
btrfs_state_lockdep_init_map ( fs_info , btrfs_trans_super_committed ,
BTRFS_LOCKDEP_TRANS_SUPER_COMMITTED ) ;
btrfs_state_lockdep_init_map ( fs_info , btrfs_trans_completed ,
BTRFS_LOCKDEP_TRANS_COMPLETED ) ;
2022-07-25 15:11:48 -07:00
2008-03-24 15:01:56 -04:00
INIT_LIST_HEAD ( & fs_info - > dirty_cowonly_roots ) ;
2008-03-24 15:01:59 -04:00
INIT_LIST_HEAD ( & fs_info - > space_info ) ;
2012-05-16 17:55:38 +02:00
INIT_LIST_HEAD ( & fs_info - > tree_mod_seq_list ) ;
2014-09-18 11:20:02 -04:00
INIT_LIST_HEAD ( & fs_info - > unused_bgs ) ;
2021-04-19 16:41:02 +09:00
INIT_LIST_HEAD ( & fs_info - > reclaim_bgs ) ;
2021-08-19 21:19:17 +09:00
INIT_LIST_HEAD ( & fs_info - > zone_active_bgs ) ;
2020-01-24 09:33:00 -05:00
# ifdef CONFIG_BTRFS_DEBUG
INIT_LIST_HEAD ( & fs_info - > allocated_roots ) ;
2020-02-14 16:11:40 -05:00
INIT_LIST_HEAD ( & fs_info - > allocated_ebs ) ;
spin_lock_init ( & fs_info - > eb_leak_lock ) ;
2020-01-24 09:33:00 -05:00
# endif
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
fs_info - > mapping_tree = RB_ROOT_CACHED ;
rwlock_init ( & fs_info - > mapping_tree_lock ) ;
2012-09-06 04:02:28 -06:00
btrfs_init_block_rsv ( & fs_info - > global_block_rsv ,
BTRFS_BLOCK_RSV_GLOBAL ) ;
btrfs_init_block_rsv ( & fs_info - > trans_block_rsv , BTRFS_BLOCK_RSV_TRANS ) ;
btrfs_init_block_rsv ( & fs_info - > chunk_block_rsv , BTRFS_BLOCK_RSV_CHUNK ) ;
btrfs_init_block_rsv ( & fs_info - > empty_block_rsv , BTRFS_BLOCK_RSV_EMPTY ) ;
btrfs_init_block_rsv ( & fs_info - > delayed_block_rsv ,
BTRFS_BLOCK_RSV_DELOPS ) ;
btrfs: introduce delayed_refs_rsv
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.
So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.
For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.
Historical background:
The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.
Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-03 10:20:33 -05:00
btrfs_init_block_rsv ( & fs_info - > delayed_refs_rsv ,
BTRFS_BLOCK_RSV_DELREFS ) ;
2008-11-06 22:02:51 -05:00
atomic_set ( & fs_info - > async_delalloc_pages , 0 ) ;
2011-05-24 15:35:30 -04:00
atomic_set ( & fs_info - > defrag_running , 0 ) ;
2018-12-03 11:06:52 -05:00
atomic_set ( & fs_info - > nr_delayed_iputs , 0 ) ;
2013-04-24 16:57:33 +00:00
atomic64_set ( & fs_info - > tree_mod_seq , 0 ) ;
2021-11-05 16:45:51 -04:00
fs_info - > global_root_tree = RB_ROOT ;
2013-08-08 22:45:48 +01:00
fs_info - > max_inline = BTRFS_DEFAULT_MAX_INLINE ;
2009-09-11 16:12:44 -04:00
fs_info - > metadata_ratio = 0 ;
2011-05-24 15:35:30 -04:00
fs_info - > defrag_inodes = RB_ROOT ;
2017-05-11 09:17:46 +03:00
atomic64_set ( & fs_info - > free_chunk_space , 0 ) ;
2012-05-16 17:55:38 +02:00
fs_info - > tree_mod_log = RB_ROOT ;
2013-08-01 18:14:52 +02:00
fs_info - > commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL ;
2017-09-29 15:43:50 -04:00
btrfs_init_ref_verify ( fs_info ) ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
2008-12-19 15:43:22 -05:00
fs_info - > thread_pool_size = min_t ( unsigned long ,
num_online_cpus ( ) + 2 , 8 ) ;
2008-04-18 14:17:20 -04:00
2013-05-15 07:48:23 +00:00
INIT_LIST_HEAD ( & fs_info - > ordered_roots ) ;
spin_lock_init ( & fs_info - > ordered_root_lock ) ;
2017-10-19 14:15:57 -04:00
2014-08-01 18:12:38 -05:00
btrfs_init_scrub ( fs_info ) ;
2014-08-01 18:12:39 -05:00
btrfs_init_balance ( fs_info ) ;
2020-07-21 10:22:33 -04:00
btrfs_init_async_reclaim_work ( fs_info ) ;
2011-03-08 14:14:00 +01:00
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
rwlock_init ( & fs_info - > block_group_cache_lock ) ;
2022-04-13 16:20:40 +01:00
fs_info - > block_group_cache_tree = RB_ROOT_CACHED ;
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
2020-01-20 16:09:18 +02:00
extent_io_tree_init ( fs_info , & fs_info - > excluded_extents ,
2022-10-28 02:47:06 +02:00
IO_TREE_FS_EXCLUDED_EXTENTS ) ;
2007-06-12 06:35:45 -04:00
2009-03-31 13:27:11 -04:00
mutex_init ( & fs_info - > ordered_operations_mutex ) ;
2008-09-05 16:13:11 -04:00
mutex_init ( & fs_info - > tree_log_mutex ) ;
2008-06-25 16:01:30 -04:00
mutex_init ( & fs_info - > chunk_mutex ) ;
2008-06-25 16:01:31 -04:00
mutex_init ( & fs_info - > transaction_kthread_mutex ) ;
mutex_init ( & fs_info - > cleaner_mutex ) ;
2015-04-06 12:46:08 -07:00
mutex_init ( & fs_info - > ro_block_group_mutex ) ;
2014-03-13 15:42:13 -04:00
init_rwsem ( & fs_info - > commit_root_sem ) ;
2009-11-12 09:34:40 +00:00
init_rwsem ( & fs_info - > cleanup_work_sem ) ;
2009-09-21 16:00:26 -04:00
init_rwsem ( & fs_info - > subvol_sem ) ;
2013-08-15 17:11:21 +02:00
sema_init ( & fs_info - > uuid_tree_rescan_sem , 1 ) ;
2009-04-03 09:47:43 -04:00
2014-08-01 18:12:41 -05:00
btrfs_init_dev_replace_locks ( fs_info ) ;
2014-08-01 18:12:42 -05:00
btrfs_init_qgroup ( fs_info ) ;
2019-12-13 16:22:14 -08:00
btrfs_discard_init ( fs_info ) ;
2011-09-13 12:56:09 +02:00
2009-04-03 09:47:43 -04:00
btrfs_init_free_cluster ( & fs_info - > meta_alloc_cluster ) ;
btrfs_init_free_cluster ( & fs_info - > data_alloc_cluster ) ;
2008-07-17 12:53:50 -04:00
init_waitqueue_head ( & fs_info - > transaction_throttle ) ;
2008-07-17 12:54:14 -04:00
init_waitqueue_head ( & fs_info - > transaction_wait ) ;
2010-10-29 15:37:34 -04:00
init_waitqueue_head ( & fs_info - > transaction_blocked_wait ) ;
2008-08-15 15:34:17 -04:00
init_waitqueue_head ( & fs_info - > async_submit_wait ) ;
2018-12-03 11:06:52 -05:00
init_waitqueue_head ( & fs_info - > delayed_iputs_wait ) ;
2007-03-13 16:47:54 -04:00
2016-06-15 09:22:56 -04:00
/* Usable values until the real ones are cached from the superblock */
fs_info - > nodesize = 4096 ;
fs_info - > sectorsize = 4096 ;
2020-07-01 20:45:04 +02:00
fs_info - > sectorsize_bits = ilog2 ( 4096 ) ;
2016-06-15 09:22:56 -04:00
fs_info - > stripesize = 4096 ;
2023-11-22 12:17:39 -05:00
/* Default compress algorithm when user does -o compress */
fs_info - > compress_type = BTRFS_COMPRESS_ZLIB ;
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-09 08:18:40 +09:00
fs_info - > max_extent_size = BTRFS_MAX_EXTENT_SIZE ;
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
spin_lock_init ( & fs_info - > swapfile_pins_lock ) ;
fs_info - > swapfile_pins = RB_ROOT ;
2021-04-19 16:41:02 +09:00
fs_info - > bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH ;
INIT_WORK ( & fs_info - > reclaim_bgs_work , btrfs_reclaim_bgs_work ) ;
2020-01-24 09:32:59 -05:00
}
static int init_mount_fs_info ( struct btrfs_fs_info * fs_info , struct super_block * sb )
{
int ret ;
fs_info - > sb = sb ;
2024-01-16 17:33:20 +01:00
/* Temporary fixed values for block size until we read the superblock. */
2020-01-24 09:32:59 -05:00
sb - > s_blocksize = BTRFS_BDEV_BLOCKSIZE ;
sb - > s_blocksize_bits = blksize_bits ( BTRFS_BDEV_BLOCKSIZE ) ;
Btrfs: prevent send failures and crashes due to concurrent relocation
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 16:44:09 +01:00
2020-10-09 09:28:20 -04:00
ret = percpu_counter_init ( & fs_info - > ordered_bytes , 0 , GFP_KERNEL ) ;
2020-01-24 09:32:58 -05:00
if ( ret )
2020-02-14 16:11:47 -05:00
return ret ;
2020-01-24 09:32:58 -05:00
2024-03-22 18:02:59 +00:00
ret = percpu_counter_init ( & fs_info - > evictable_extent_maps , 0 , GFP_KERNEL ) ;
if ( ret )
return ret ;
2024-07-08 15:42:45 +01:00
spin_lock_init ( & fs_info - > extent_map_shrinker_lock ) ;
2020-01-24 09:32:58 -05:00
ret = percpu_counter_init ( & fs_info - > dirty_metadata_bytes , 0 , GFP_KERNEL ) ;
if ( ret )
2020-02-14 16:11:47 -05:00
return ret ;
2020-01-24 09:32:58 -05:00
fs_info - > dirty_metadata_batch = PAGE_SIZE *
( 1 + ilog2 ( nr_cpu_ids ) ) ;
ret = percpu_counter_init ( & fs_info - > delalloc_bytes , 0 , GFP_KERNEL ) ;
if ( ret )
2020-02-14 16:11:47 -05:00
return ret ;
2020-01-24 09:32:58 -05:00
ret = percpu_counter_init ( & fs_info - > dev_replace . bio_counter , 0 ,
GFP_KERNEL ) ;
if ( ret )
2020-02-14 16:11:47 -05:00
return ret ;
2020-01-24 09:32:58 -05:00
fs_info - > delayed_root = kmalloc ( sizeof ( struct btrfs_delayed_root ) ,
GFP_KERNEL ) ;
2020-02-14 16:11:47 -05:00
if ( ! fs_info - > delayed_root )
return - ENOMEM ;
2020-01-24 09:32:58 -05:00
btrfs_init_delayed_root ( fs_info - > delayed_root ) ;
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
if ( sb_rdonly ( sb ) )
set_bit ( BTRFS_FS_STATE_RO , & fs_info - > fs_state ) ;
2024-06-14 13:52:30 +09:30
if ( btrfs_test_opt ( fs_info , IGNOREMETACSUMS ) )
set_bit ( BTRFS_FS_STATE_SKIP_META_CSUMS , & fs_info - > fs_state ) ;
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
2020-02-14 16:11:47 -05:00
return btrfs_alloc_stripe_hash_table ( fs_info ) ;
2020-01-24 09:32:58 -05:00
}
2020-02-18 16:56:08 +02:00
static int btrfs_uuid_rescan_kthread ( void * data )
{
2022-03-31 03:34:08 -07:00
struct btrfs_fs_info * fs_info = data ;
2020-02-18 16:56:08 +02:00
int ret ;
/*
* 1 st step is to iterate through the existing UUID tree and
* to delete all entries that contain outdated data .
* 2 nd step is to add all missing entries to the UUID tree .
*/
ret = btrfs_uuid_tree_iterate ( fs_info ) ;
if ( ret < 0 ) {
2020-02-14 15:05:01 -05:00
if ( ret ! = - EINTR )
btrfs_warn ( fs_info , " iterating uuid_tree failed %d " ,
ret ) ;
2020-02-18 16:56:08 +02:00
up ( & fs_info - > uuid_tree_rescan_sem ) ;
return ret ;
}
return btrfs_uuid_scan_kthread ( data ) ;
}
static int btrfs_check_uuid_tree ( struct btrfs_fs_info * fs_info )
{
struct task_struct * task ;
down ( & fs_info - > uuid_tree_rescan_sem ) ;
task = kthread_run ( btrfs_uuid_rescan_kthread , fs_info , " btrfs-uuid " ) ;
if ( IS_ERR ( task ) ) {
/* fs_info->update_uuid_tree_gen remains 0 in all error case */
btrfs_warn ( fs_info , " failed to start uuid_rescan task " ) ;
up ( & fs_info - > uuid_tree_rescan_sem ) ;
return PTR_ERR ( task ) ;
}
return 0 ;
}
2023-07-26 16:57:07 +01:00
static int btrfs_cleanup_fs_roots ( struct btrfs_fs_info * fs_info )
{
u64 root_objectid = 0 ;
struct btrfs_root * gang [ 8 ] ;
2024-03-19 20:25:09 +05:30
int ret = 0 ;
2023-07-26 16:57:07 +01:00
while ( 1 ) {
2024-03-19 20:25:09 +05:30
unsigned int found ;
2023-07-26 16:57:07 +01:00
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
2024-03-19 20:25:09 +05:30
found = radix_tree_gang_lookup ( & fs_info - > fs_roots_radix ,
2023-07-26 16:57:07 +01:00
( void * * ) gang , root_objectid ,
ARRAY_SIZE ( gang ) ) ;
2024-03-19 20:25:09 +05:30
if ( ! found ) {
2023-07-26 16:57:07 +01:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
break ;
}
2024-03-19 20:25:09 +05:30
root_objectid = btrfs_root_id ( gang [ found - 1 ] ) + 1 ;
2023-07-26 16:57:07 +01:00
2024-03-19 20:25:09 +05:30
for ( int i = 0 ; i < found ; i + + ) {
2023-07-26 16:57:07 +01:00
/* Avoid to grab roots in dead_roots. */
if ( btrfs_root_refs ( & gang [ i ] - > root_item ) = = 0 ) {
gang [ i ] = NULL ;
continue ;
}
/* Grab all the search result for later use. */
gang [ i ] = btrfs_grab_root ( gang [ i ] ) ;
}
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2024-03-19 20:25:09 +05:30
for ( int i = 0 ; i < found ; i + + ) {
2023-07-26 16:57:07 +01:00
if ( ! gang [ i ] )
continue ;
2024-04-15 16:16:23 -04:00
root_objectid = btrfs_root_id ( gang [ i ] ) ;
2024-03-19 20:25:09 +05:30
/*
* Continue to release the remaining roots after the first
* error without cleanup and preserve the first error
* for the return .
*/
if ( ! ret )
ret = btrfs_orphan_cleanup ( gang [ i ] ) ;
2023-07-26 16:57:07 +01:00
btrfs_put_root ( gang [ i ] ) ;
}
2024-03-19 20:25:09 +05:30
if ( ret )
break ;
2023-07-26 16:57:07 +01:00
root_objectid + + ;
}
2024-03-19 20:25:09 +05:30
return ret ;
2023-07-26 16:57:07 +01:00
}
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
/*
* Mounting logic specific to read - write file systems . Shared by open_ctree
* and btrfs_remount when remounting from read - only to read - write .
*/
int btrfs_start_pre_rw_mount ( struct btrfs_fs_info * fs_info )
{
int ret ;
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:22 -08:00
const bool cache_opt = btrfs_test_opt ( fs_info , SPACE_CACHE ) ;
2023-04-28 14:13:05 +08:00
bool rebuild_free_space_tree = false ;
2020-11-18 15:06:21 -08:00
if ( btrfs_test_opt ( fs_info , CLEAR_CACHE ) & &
btrfs_fs_compat_ro ( fs_info , FREE_SPACE_TREE ) ) {
2023-11-22 12:17:41 -05:00
if ( btrfs_fs_incompat ( fs_info , EXTENT_TREE_V2 ) )
btrfs_warn ( fs_info ,
" 'clear_cache' option is ignored with extent tree v2 " ) ;
else
rebuild_free_space_tree = true ;
2020-11-18 15:06:21 -08:00
} else if ( btrfs_fs_compat_ro ( fs_info , FREE_SPACE_TREE ) & &
! btrfs_fs_compat_ro ( fs_info , FREE_SPACE_TREE_VALID ) ) {
btrfs_warn ( fs_info , " free space tree is invalid " ) ;
2023-04-28 14:13:05 +08:00
rebuild_free_space_tree = true ;
2020-11-18 15:06:21 -08:00
}
2023-04-28 14:13:05 +08:00
if ( rebuild_free_space_tree ) {
btrfs_info ( fs_info , " rebuilding free space tree " ) ;
ret = btrfs_rebuild_free_space_tree ( fs_info ) ;
2020-11-18 15:06:21 -08:00
if ( ret ) {
btrfs_warn ( fs_info ,
2023-04-28 14:13:05 +08:00
" failed to rebuild free space tree: %d " , ret ) ;
goto out ;
}
}
if ( btrfs_fs_compat_ro ( fs_info , FREE_SPACE_TREE ) & &
! btrfs_test_opt ( fs_info , FREE_SPACE_TREE ) ) {
btrfs_info ( fs_info , " disabling free space tree " ) ;
ret = btrfs_delete_free_space_tree ( fs_info ) ;
if ( ret ) {
btrfs_warn ( fs_info ,
" failed to disable free space tree: %d " , ret ) ;
2020-11-18 15:06:21 -08:00
goto out ;
}
}
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
2021-03-16 16:53:46 +00:00
/*
* btrfs_find_orphan_roots ( ) is responsible for finding all the dead
* roots ( with 0 refs ) , flag them with BTRFS_ROOT_DEAD_TREE and load
2022-07-15 13:59:21 +02:00
* them into the fs_info - > fs_roots_radix tree . This must be done before
2021-03-16 16:53:46 +00:00
* calling btrfs_orphan_cleanup ( ) on the tree root . If we don ' t do it
* first , then btrfs_orphan_cleanup ( ) will delete a dead root ' s orphan
* item before the root ' s tree is deleted - this means that if we unmount
* or crash before the deletion completes , on the next mount we will not
* delete what remains of the tree because the orphan item does not
* exists anymore , which is what tells us we have a pending deletion .
*/
ret = btrfs_find_orphan_roots ( fs_info ) ;
if ( ret )
goto out ;
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
ret = btrfs_cleanup_fs_roots ( fs_info ) ;
if ( ret )
goto out ;
2020-11-18 15:06:17 -08:00
down_read ( & fs_info - > cleanup_work_sem ) ;
if ( ( ret = btrfs_orphan_cleanup ( fs_info - > fs_root ) ) | |
( ret = btrfs_orphan_cleanup ( fs_info - > tree_root ) ) ) {
up_read ( & fs_info - > cleanup_work_sem ) ;
goto out ;
}
up_read ( & fs_info - > cleanup_work_sem ) ;
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
mutex_lock ( & fs_info - > cleaner_mutex ) ;
2022-02-18 14:56:12 -05:00
ret = btrfs_recover_relocation ( fs_info ) ;
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
mutex_unlock ( & fs_info - > cleaner_mutex ) ;
if ( ret < 0 ) {
btrfs_warn ( fs_info , " failed to recover relocation: %d " , ret ) ;
goto out ;
}
2020-11-18 15:06:19 -08:00
if ( btrfs_test_opt ( fs_info , FREE_SPACE_TREE ) & &
! btrfs_fs_compat_ro ( fs_info , FREE_SPACE_TREE ) ) {
btrfs_info ( fs_info , " creating free space tree " ) ;
ret = btrfs_create_free_space_tree ( fs_info ) ;
if ( ret ) {
btrfs_warn ( fs_info ,
" failed to create free space tree: %d " , ret ) ;
goto out ;
}
}
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:22 -08:00
if ( cache_opt ! = btrfs_free_space_cache_v1_active ( fs_info ) ) {
ret = btrfs_set_free_space_cache_v1_active ( fs_info , cache_opt ) ;
if ( ret )
goto out ;
}
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
ret = btrfs_resume_balance_async ( fs_info ) ;
if ( ret )
goto out ;
ret = btrfs_resume_dev_replace_async ( fs_info ) ;
if ( ret ) {
btrfs_warn ( fs_info , " failed to resume dev_replace " ) ;
goto out ;
}
btrfs_qgroup_rescan_resume ( fs_info ) ;
if ( ! fs_info - > uuid_root ) {
btrfs_info ( fs_info , " creating UUID tree " ) ;
ret = btrfs_create_uuid_tree ( fs_info ) ;
if ( ret ) {
btrfs_warn ( fs_info ,
" failed to create the UUID tree %d " , ret ) ;
goto out ;
}
}
out :
return ret ;
}
btrfs: relax block-group-tree feature dependency checks
[BUG]
When one user did a wrong attempt to clear block group tree, which can
not be done through mount option, by using "-o clear_cache,space_cache=v2",
it will cause the following error on a fs with block-group-tree feature:
BTRFS info (device dm-1): force clearing of disk cache
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): clearing free space tree
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
BTRFS error (device dm-1): block-group-tree feature requires fres-space-tree and no-holes
BTRFS error (device dm-1): super block corruption detected before writing it to disk
BTRFS: error (device dm-1) in write_all_supers:4318: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
BTRFS warning (device dm-1: state E): Skipping commit of aborted transaction.
[CAUSE]
Although the dependency for block-group-tree feature is just an
artificial one (to reduce test matrix), we put the dependency check into
btrfs_validate_super().
This is too strict, and during space cache clearing, we will have a
window where free space tree is cleared, and we need to commit the super
block.
In that window, we had block group tree without v2 cache, and triggered
the artificial dependency check.
This is not necessary at all, especially for such a soft dependency.
[FIX]
Introduce a new helper, btrfs_check_features(), to do all the runtime
limitation checks, including:
- Unsupported incompat flags check
- Unsupported compat RO flags check
- Setting missing incompat flags
- Artificial feature dependency checks
Currently only block group tree will rely on this.
- Subpage runtime check for v1 cache
With this helper, we can move quite some checks from
open_ctree()/btrfs_remount() into it, and just call it after
btrfs_parse_options().
Now "-o clear_cache,space_cache=v2" will not trigger the above error
anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-12 13:44:37 +08:00
/*
* Do various sanity and dependency checks of different features .
*
btrfs: fix compat_ro checks against remount
[BUG]
Even with commit 81d5d61454c3 ("btrfs: enhance unsupported compat RO
flags handling"), btrfs can still mount a fs with unsupported compat_ro
flags read-only, then remount it RW:
# btrfs ins dump-super /dev/loop0 | grep compat_ro_flags -A 3
compat_ro_flags 0x403
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x400 )
# mount /dev/loop0 /mnt/btrfs
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
^^^ RW mount failed as expected ^^^
# dmesg -t | tail -n5
loop0: detected capacity change from 0 to 1048576
BTRFS: device fsid cb5b82f5-0fdd-4d81-9b4b-78533c324afa devid 1 transid 7 /dev/loop0 scanned by mount (1146)
BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
BTRFS info (device loop0): using free space tree
BTRFS error (device loop0): cannot mount read-write because of unknown compat_ro features (0x403)
BTRFS error (device loop0): open_ctree failed
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
^^^ RW remount succeeded unexpectedly ^^^
[CAUSE]
Currently we use btrfs_check_features() to check compat_ro flags against
our current mount flags.
That function get reused between open_ctree() and btrfs_remount().
But for btrfs_remount(), the super block we passed in still has the old
mount flags, thus btrfs_check_features() still believes we're mounting
read-only.
[FIX]
Replace the existing @sb argument with @is_rw_mount.
As originally we only use @sb to determine if the mount is RW.
Now it's callers' responsibility to determine if the mount is RW, and
since there are only two callers, the check is pretty simple:
- caller in open_ctree()
Just pass !sb_rdonly().
- caller in btrfs_remount()
Pass !(*flags & SB_RDONLY), as our check should be against the new
flags.
Now we can correctly reject the RW remount:
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
mount: /mnt/btrfs: mount point not mounted or bad option.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n 1
BTRFS error (device loop0: state M): cannot mount read-write because of unknown compat_ro features (0x403)
Reported-by: Chung-Chiang Cheng <shepjeng@gmail.com>
Fixes: 81d5d61454c3 ("btrfs: enhance unsupported compat RO flags handling")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-22 07:59:17 +08:00
* @ is_rw_mount : If the mount is read - write .
*
btrfs: relax block-group-tree feature dependency checks
[BUG]
When one user did a wrong attempt to clear block group tree, which can
not be done through mount option, by using "-o clear_cache,space_cache=v2",
it will cause the following error on a fs with block-group-tree feature:
BTRFS info (device dm-1): force clearing of disk cache
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): clearing free space tree
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
BTRFS error (device dm-1): block-group-tree feature requires fres-space-tree and no-holes
BTRFS error (device dm-1): super block corruption detected before writing it to disk
BTRFS: error (device dm-1) in write_all_supers:4318: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
BTRFS warning (device dm-1: state E): Skipping commit of aborted transaction.
[CAUSE]
Although the dependency for block-group-tree feature is just an
artificial one (to reduce test matrix), we put the dependency check into
btrfs_validate_super().
This is too strict, and during space cache clearing, we will have a
window where free space tree is cleared, and we need to commit the super
block.
In that window, we had block group tree without v2 cache, and triggered
the artificial dependency check.
This is not necessary at all, especially for such a soft dependency.
[FIX]
Introduce a new helper, btrfs_check_features(), to do all the runtime
limitation checks, including:
- Unsupported incompat flags check
- Unsupported compat RO flags check
- Setting missing incompat flags
- Artificial feature dependency checks
Currently only block group tree will rely on this.
- Subpage runtime check for v1 cache
With this helper, we can move quite some checks from
open_ctree()/btrfs_remount() into it, and just call it after
btrfs_parse_options().
Now "-o clear_cache,space_cache=v2" will not trigger the above error
anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-12 13:44:37 +08:00
* This is the place for less strict checks ( like for subpage or artificial
* feature dependencies ) .
*
* For strict checks or possible corruption detection , see
* btrfs_validate_super ( ) .
*
* This should be called after btrfs_parse_options ( ) , as some mount options
* ( space cache related ) can modify on - disk format like free space tree and
* screw up certain feature dependencies .
*/
btrfs: fix compat_ro checks against remount
[BUG]
Even with commit 81d5d61454c3 ("btrfs: enhance unsupported compat RO
flags handling"), btrfs can still mount a fs with unsupported compat_ro
flags read-only, then remount it RW:
# btrfs ins dump-super /dev/loop0 | grep compat_ro_flags -A 3
compat_ro_flags 0x403
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x400 )
# mount /dev/loop0 /mnt/btrfs
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
^^^ RW mount failed as expected ^^^
# dmesg -t | tail -n5
loop0: detected capacity change from 0 to 1048576
BTRFS: device fsid cb5b82f5-0fdd-4d81-9b4b-78533c324afa devid 1 transid 7 /dev/loop0 scanned by mount (1146)
BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
BTRFS info (device loop0): using free space tree
BTRFS error (device loop0): cannot mount read-write because of unknown compat_ro features (0x403)
BTRFS error (device loop0): open_ctree failed
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
^^^ RW remount succeeded unexpectedly ^^^
[CAUSE]
Currently we use btrfs_check_features() to check compat_ro flags against
our current mount flags.
That function get reused between open_ctree() and btrfs_remount().
But for btrfs_remount(), the super block we passed in still has the old
mount flags, thus btrfs_check_features() still believes we're mounting
read-only.
[FIX]
Replace the existing @sb argument with @is_rw_mount.
As originally we only use @sb to determine if the mount is RW.
Now it's callers' responsibility to determine if the mount is RW, and
since there are only two callers, the check is pretty simple:
- caller in open_ctree()
Just pass !sb_rdonly().
- caller in btrfs_remount()
Pass !(*flags & SB_RDONLY), as our check should be against the new
flags.
Now we can correctly reject the RW remount:
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
mount: /mnt/btrfs: mount point not mounted or bad option.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n 1
BTRFS error (device loop0: state M): cannot mount read-write because of unknown compat_ro features (0x403)
Reported-by: Chung-Chiang Cheng <shepjeng@gmail.com>
Fixes: 81d5d61454c3 ("btrfs: enhance unsupported compat RO flags handling")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-22 07:59:17 +08:00
int btrfs_check_features ( struct btrfs_fs_info * fs_info , bool is_rw_mount )
btrfs: relax block-group-tree feature dependency checks
[BUG]
When one user did a wrong attempt to clear block group tree, which can
not be done through mount option, by using "-o clear_cache,space_cache=v2",
it will cause the following error on a fs with block-group-tree feature:
BTRFS info (device dm-1): force clearing of disk cache
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): clearing free space tree
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
BTRFS error (device dm-1): block-group-tree feature requires fres-space-tree and no-holes
BTRFS error (device dm-1): super block corruption detected before writing it to disk
BTRFS: error (device dm-1) in write_all_supers:4318: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
BTRFS warning (device dm-1: state E): Skipping commit of aborted transaction.
[CAUSE]
Although the dependency for block-group-tree feature is just an
artificial one (to reduce test matrix), we put the dependency check into
btrfs_validate_super().
This is too strict, and during space cache clearing, we will have a
window where free space tree is cleared, and we need to commit the super
block.
In that window, we had block group tree without v2 cache, and triggered
the artificial dependency check.
This is not necessary at all, especially for such a soft dependency.
[FIX]
Introduce a new helper, btrfs_check_features(), to do all the runtime
limitation checks, including:
- Unsupported incompat flags check
- Unsupported compat RO flags check
- Setting missing incompat flags
- Artificial feature dependency checks
Currently only block group tree will rely on this.
- Subpage runtime check for v1 cache
With this helper, we can move quite some checks from
open_ctree()/btrfs_remount() into it, and just call it after
btrfs_parse_options().
Now "-o clear_cache,space_cache=v2" will not trigger the above error
anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-12 13:44:37 +08:00
{
struct btrfs_super_block * disk_super = fs_info - > super_copy ;
u64 incompat = btrfs_super_incompat_flags ( disk_super ) ;
const u64 compat_ro = btrfs_super_compat_ro_flags ( disk_super ) ;
const u64 compat_ro_unsupp = ( compat_ro & ~ BTRFS_FEATURE_COMPAT_RO_SUPP ) ;
if ( incompat & ~ BTRFS_FEATURE_INCOMPAT_SUPP ) {
btrfs_err ( fs_info ,
" cannot mount because of unknown incompat features (0x%llx) " ,
incompat ) ;
return - EINVAL ;
}
/* Runtime limitation for mixed block groups. */
if ( ( incompat & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS ) & &
( fs_info - > sectorsize ! = fs_info - > nodesize ) ) {
btrfs_err ( fs_info ,
" unequal nodesize/sectorsize (%u != %u) are not allowed for mixed block groups " ,
fs_info - > nodesize , fs_info - > sectorsize ) ;
return - EINVAL ;
}
/* Mixed backref is an always-enabled feature. */
incompat | = BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF ;
/* Set compression related flags just in case. */
if ( fs_info - > compress_type = = BTRFS_COMPRESS_LZO )
incompat | = BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO ;
else if ( fs_info - > compress_type = = BTRFS_COMPRESS_ZSTD )
incompat | = BTRFS_FEATURE_INCOMPAT_COMPRESS_ZSTD ;
/*
* An ancient flag , which should really be marked deprecated .
* Such runtime limitation doesn ' t really need a incompat flag .
*/
if ( btrfs_super_nodesize ( disk_super ) > PAGE_SIZE )
incompat | = BTRFS_FEATURE_INCOMPAT_BIG_METADATA ;
btrfs: fix compat_ro checks against remount
[BUG]
Even with commit 81d5d61454c3 ("btrfs: enhance unsupported compat RO
flags handling"), btrfs can still mount a fs with unsupported compat_ro
flags read-only, then remount it RW:
# btrfs ins dump-super /dev/loop0 | grep compat_ro_flags -A 3
compat_ro_flags 0x403
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x400 )
# mount /dev/loop0 /mnt/btrfs
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
^^^ RW mount failed as expected ^^^
# dmesg -t | tail -n5
loop0: detected capacity change from 0 to 1048576
BTRFS: device fsid cb5b82f5-0fdd-4d81-9b4b-78533c324afa devid 1 transid 7 /dev/loop0 scanned by mount (1146)
BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
BTRFS info (device loop0): using free space tree
BTRFS error (device loop0): cannot mount read-write because of unknown compat_ro features (0x403)
BTRFS error (device loop0): open_ctree failed
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
^^^ RW remount succeeded unexpectedly ^^^
[CAUSE]
Currently we use btrfs_check_features() to check compat_ro flags against
our current mount flags.
That function get reused between open_ctree() and btrfs_remount().
But for btrfs_remount(), the super block we passed in still has the old
mount flags, thus btrfs_check_features() still believes we're mounting
read-only.
[FIX]
Replace the existing @sb argument with @is_rw_mount.
As originally we only use @sb to determine if the mount is RW.
Now it's callers' responsibility to determine if the mount is RW, and
since there are only two callers, the check is pretty simple:
- caller in open_ctree()
Just pass !sb_rdonly().
- caller in btrfs_remount()
Pass !(*flags & SB_RDONLY), as our check should be against the new
flags.
Now we can correctly reject the RW remount:
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
mount: /mnt/btrfs: mount point not mounted or bad option.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n 1
BTRFS error (device loop0: state M): cannot mount read-write because of unknown compat_ro features (0x403)
Reported-by: Chung-Chiang Cheng <shepjeng@gmail.com>
Fixes: 81d5d61454c3 ("btrfs: enhance unsupported compat RO flags handling")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-22 07:59:17 +08:00
if ( compat_ro_unsupp & & is_rw_mount ) {
btrfs: relax block-group-tree feature dependency checks
[BUG]
When one user did a wrong attempt to clear block group tree, which can
not be done through mount option, by using "-o clear_cache,space_cache=v2",
it will cause the following error on a fs with block-group-tree feature:
BTRFS info (device dm-1): force clearing of disk cache
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): clearing free space tree
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
BTRFS error (device dm-1): block-group-tree feature requires fres-space-tree and no-holes
BTRFS error (device dm-1): super block corruption detected before writing it to disk
BTRFS: error (device dm-1) in write_all_supers:4318: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
BTRFS warning (device dm-1: state E): Skipping commit of aborted transaction.
[CAUSE]
Although the dependency for block-group-tree feature is just an
artificial one (to reduce test matrix), we put the dependency check into
btrfs_validate_super().
This is too strict, and during space cache clearing, we will have a
window where free space tree is cleared, and we need to commit the super
block.
In that window, we had block group tree without v2 cache, and triggered
the artificial dependency check.
This is not necessary at all, especially for such a soft dependency.
[FIX]
Introduce a new helper, btrfs_check_features(), to do all the runtime
limitation checks, including:
- Unsupported incompat flags check
- Unsupported compat RO flags check
- Setting missing incompat flags
- Artificial feature dependency checks
Currently only block group tree will rely on this.
- Subpage runtime check for v1 cache
With this helper, we can move quite some checks from
open_ctree()/btrfs_remount() into it, and just call it after
btrfs_parse_options().
Now "-o clear_cache,space_cache=v2" will not trigger the above error
anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-12 13:44:37 +08:00
btrfs_err ( fs_info ,
" cannot mount read-write because of unknown compat_ro features (0x%llx) " ,
compat_ro ) ;
return - EINVAL ;
}
/*
* We have unsupported RO compat features , although RO mounted , we
* should not cause any metadata writes , including log replay .
* Or we could screw up whatever the new feature requires .
*/
if ( compat_ro_unsupp & & btrfs_super_log_root ( disk_super ) & &
! btrfs_test_opt ( fs_info , NOLOGREPLAY ) ) {
btrfs_err ( fs_info ,
" cannot replay dirty log with unsupported compat_ro features (0x%llx), try rescue=nologreplay " ,
compat_ro ) ;
return - EINVAL ;
}
/*
* Artificial limitations for block group tree , to force
* block - group - tree to rely on no - holes and free - space - tree .
*/
if ( btrfs_fs_compat_ro ( fs_info , BLOCK_GROUP_TREE ) & &
( ! btrfs_fs_incompat ( fs_info , NO_HOLES ) | |
! btrfs_test_opt ( fs_info , FREE_SPACE_TREE ) ) ) {
btrfs_err ( fs_info ,
" block-group-tree feature requires no-holes and free-space-tree features " ) ;
return - EINVAL ;
}
/*
* Subpage runtime limitation on v1 cache .
*
* V1 space cache still has some hard codeed PAGE_SIZE usage , while
* we ' re already defaulting to v2 cache , no need to bother v1 as it ' s
* going to be deprecated anyway .
*/
if ( fs_info - > sectorsize < PAGE_SIZE & & btrfs_test_opt ( fs_info , SPACE_CACHE ) ) {
btrfs_warn ( fs_info ,
" v1 space cache is not supported for page size %lu with sectorsize %u " ,
PAGE_SIZE , fs_info - > sectorsize ) ;
return - EINVAL ;
}
/* This can be called by remount, we need to protect the super block. */
spin_lock ( & fs_info - > super_lock ) ;
btrfs_set_super_incompat_flags ( disk_super , incompat ) ;
spin_unlock ( & fs_info - > super_lock ) ;
return 0 ;
}
2020-01-24 09:32:58 -05:00
int __cold open_ctree ( struct super_block * sb , struct btrfs_fs_devices * fs_devices ,
2024-05-30 19:14:12 +02:00
const char * options )
2020-01-24 09:32:58 -05:00
{
u32 sectorsize ;
u32 nodesize ;
u32 stripesize ;
u64 generation ;
u16 csum_type ;
struct btrfs_super_block * disk_super ;
struct btrfs_fs_info * fs_info = btrfs_sb ( sb ) ;
struct btrfs_root * tree_root ;
struct btrfs_root * chunk_root ;
int ret ;
int level ;
2020-01-24 09:32:59 -05:00
ret = init_mount_fs_info ( fs_info , sb ) ;
2023-02-28 08:44:30 +08:00
if ( ret )
2020-01-24 09:32:58 -05:00
goto fail ;
2013-01-29 18:40:14 -05:00
2020-01-24 09:32:58 -05:00
/* These need to be init'ed before we start creating inodes and such. */
tree_root = btrfs_alloc_root ( fs_info , BTRFS_ROOT_TREE_OBJECTID ,
GFP_KERNEL ) ;
fs_info - > tree_root = tree_root ;
chunk_root = btrfs_alloc_root ( fs_info , BTRFS_CHUNK_TREE_OBJECTID ,
GFP_KERNEL ) ;
fs_info - > chunk_root = chunk_root ;
if ( ! tree_root | | ! chunk_root ) {
2023-02-28 08:44:30 +08:00
ret = - ENOMEM ;
2020-02-14 16:11:47 -05:00
goto fail ;
2020-01-24 09:32:58 -05:00
}
2023-02-19 19:10:22 +01:00
ret = btrfs_init_btree_inode ( sb ) ;
2023-02-28 08:44:30 +08:00
if ( ret )
2020-02-14 16:11:47 -05:00
goto fail ;
2020-01-24 09:32:58 -05:00
2021-08-24 13:05:19 +08:00
invalidate_bdev ( fs_devices - > latest_dev - > bdev ) ;
2013-03-06 15:57:46 +01:00
/*
* Read super block and check the signature bytes only
*/
2021-08-24 13:05:19 +08:00
disk_super = btrfs_read_dev_super ( fs_devices - > latest_dev - > bdev ) ;
2020-02-14 00:24:32 +09:00
if ( IS_ERR ( disk_super ) ) {
2023-02-28 08:44:30 +08:00
ret = PTR_ERR ( disk_super ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
goto fail_alloc ;
2011-01-08 10:09:13 +00:00
}
2007-06-12 06:35:45 -04:00
2023-11-02 07:54:50 +10:30
btrfs_info ( fs_info , " first mount of filesystem %pU " , disk_super - > fsid ) ;
2019-06-03 16:58:54 +02:00
/*
2020-08-04 19:48:34 -07:00
* Verify the type first , if that or the checksum value are
2019-06-03 16:58:54 +02:00
* corrupted , we ' ll find out
*/
2020-02-14 00:24:32 +09:00
csum_type = btrfs_super_csum_type ( disk_super ) ;
2019-06-03 16:58:55 +02:00
if ( ! btrfs_supported_super_csum ( csum_type ) ) {
2019-06-03 16:58:54 +02:00
btrfs_err ( fs_info , " unsupported checksum algorithm: %u " ,
2019-06-03 16:58:55 +02:00
csum_type ) ;
2023-02-28 08:44:30 +08:00
ret = - EINVAL ;
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( disk_super ) ;
2019-06-03 16:58:54 +02:00
goto fail_alloc ;
}
2021-02-11 16:38:28 +08:00
fs_info - > csum_size = btrfs_super_csum_size ( disk_super ) ;
2019-06-03 16:58:56 +02:00
ret = btrfs_init_csum_hash ( fs_info , csum_type ) ;
if ( ret ) {
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( disk_super ) ;
2019-06-03 16:58:56 +02:00
goto fail_alloc ;
}
2013-03-06 15:57:46 +01:00
/*
* We want to check superblock checksum , the type is stored inside .
* Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE ( 4 k ) .
*/
2022-10-18 09:56:38 +08:00
if ( btrfs_check_super_csum ( fs_info , disk_super ) ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " superblock checksum mismatch " ) ;
2023-02-28 08:44:30 +08:00
ret = - EINVAL ;
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( disk_super ) ;
2020-01-24 09:32:57 -05:00
goto fail_alloc ;
2013-03-06 15:57:46 +01:00
}
/*
* super_copy is zeroed at allocation time and we never touch the
* following bytes up to INFO_SIZE , the checksum is calculated from
* the whole block of INFO_SIZE
*/
2020-02-14 00:24:32 +09:00
memcpy ( fs_info - > super_copy , disk_super , sizeof ( * fs_info - > super_copy ) ) ;
btrfs_release_disk_super ( disk_super ) ;
2007-10-15 16:14:19 -04:00
2018-10-30 16:43:25 +02:00
disk_super = fs_info - > super_copy ;
memcpy ( fs_info - > super_for_commit , fs_info - > super_copy ,
sizeof ( * fs_info - > super_for_commit ) ) ;
2018-10-30 16:43:24 +02:00
2018-05-11 13:35:26 +08:00
ret = btrfs_validate_mount_super ( fs_info ) ;
2013-03-06 15:57:46 +01:00
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " superblock contains fatal errors " ) ;
2023-02-28 08:44:30 +08:00
ret = - EINVAL ;
2020-01-24 09:32:57 -05:00
goto fail_alloc ;
2013-03-06 15:57:46 +01:00
}
2023-02-28 08:44:30 +08:00
if ( ! btrfs_super_root ( disk_super ) ) {
btrfs_err ( fs_info , " invalid superblock tree root bytenr " ) ;
ret = - EINVAL ;
2020-01-24 09:32:57 -05:00
goto fail_alloc ;
2023-02-28 08:44:30 +08:00
}
2007-04-09 10:42:37 -04:00
2011-01-06 19:30:25 +08:00
/* check FS state, whether FS is broken. */
2013-01-29 10:14:48 +00:00
if ( btrfs_super_flags ( disk_super ) & BTRFS_SUPER_FLAG_ERROR )
2023-07-26 16:57:04 +01:00
WRITE_ONCE ( fs_info - > fs_error , - EUCLEAN ) ;
2011-01-06 19:30:25 +08:00
2021-08-10 23:23:44 +08:00
/* Set up fs_info before parsing mount options */
nodesize = btrfs_super_nodesize ( disk_super ) ;
sectorsize = btrfs_super_sectorsize ( disk_super ) ;
stripesize = sectorsize ;
fs_info - > dirty_metadata_batch = nodesize * ( 1 + ilog2 ( nr_cpu_ids ) ) ;
fs_info - > delalloc_batch = sectorsize * 512 * ( 1 + ilog2 ( nr_cpu_ids ) ) ;
fs_info - > nodesize = nodesize ;
fs_info - > sectorsize = sectorsize ;
fs_info - > sectorsize_bits = ilog2 ( sectorsize ) ;
fs_info - > csums_per_leaf = BTRFS_MAX_ITEM_SIZE ( fs_info ) / fs_info - > csum_size ;
fs_info - > stripesize = stripesize ;
2023-11-22 12:17:40 -05:00
/*
* Handle the space caching options appropriately now that we have the
* super block loaded and validated .
*/
btrfs_set_free_space_cache_settings ( fs_info ) ;
2023-11-22 12:17:50 -05:00
if ( ! btrfs_check_options ( fs_info , & fs_info - > mount_opt , sb - > s_flags ) ) {
ret = - EINVAL ;
2020-01-24 09:32:57 -05:00
goto fail_alloc ;
2023-11-22 12:17:50 -05:00
}
2008-05-13 13:46:40 -04:00
btrfs: fix compat_ro checks against remount
[BUG]
Even with commit 81d5d61454c3 ("btrfs: enhance unsupported compat RO
flags handling"), btrfs can still mount a fs with unsupported compat_ro
flags read-only, then remount it RW:
# btrfs ins dump-super /dev/loop0 | grep compat_ro_flags -A 3
compat_ro_flags 0x403
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x400 )
# mount /dev/loop0 /mnt/btrfs
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
^^^ RW mount failed as expected ^^^
# dmesg -t | tail -n5
loop0: detected capacity change from 0 to 1048576
BTRFS: device fsid cb5b82f5-0fdd-4d81-9b4b-78533c324afa devid 1 transid 7 /dev/loop0 scanned by mount (1146)
BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
BTRFS info (device loop0): using free space tree
BTRFS error (device loop0): cannot mount read-write because of unknown compat_ro features (0x403)
BTRFS error (device loop0): open_ctree failed
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
^^^ RW remount succeeded unexpectedly ^^^
[CAUSE]
Currently we use btrfs_check_features() to check compat_ro flags against
our current mount flags.
That function get reused between open_ctree() and btrfs_remount().
But for btrfs_remount(), the super block we passed in still has the old
mount flags, thus btrfs_check_features() still believes we're mounting
read-only.
[FIX]
Replace the existing @sb argument with @is_rw_mount.
As originally we only use @sb to determine if the mount is RW.
Now it's callers' responsibility to determine if the mount is RW, and
since there are only two callers, the check is pretty simple:
- caller in open_ctree()
Just pass !sb_rdonly().
- caller in btrfs_remount()
Pass !(*flags & SB_RDONLY), as our check should be against the new
flags.
Now we can correctly reject the RW remount:
# mount /dev/loop0 -o ro /mnt/btrfs
# mount -o remount,rw /mnt/btrfs
mount: /mnt/btrfs: mount point not mounted or bad option.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n 1
BTRFS error (device loop0: state M): cannot mount read-write because of unknown compat_ro features (0x403)
Reported-by: Chung-Chiang Cheng <shepjeng@gmail.com>
Fixes: 81d5d61454c3 ("btrfs: enhance unsupported compat RO flags handling")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-22 07:59:17 +08:00
ret = btrfs_check_features ( fs_info , ! sb_rdonly ( sb ) ) ;
2023-02-28 08:44:30 +08:00
if ( ret < 0 )
btrfs: reject log replay if there is unsupported RO compat flag
[BUG]
If we have a btrfs image with dirty log, along with an unsupported RO
compatible flag:
log_root 30474240
...
compat_flags 0x0
compat_ro_flags 0x40000003
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x40000000 )
Then even if we can only mount it RO, we will still cause metadata
update for log replay:
BTRFS info (device dm-1): flagging fs with big metadata feature
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): has skinny extents
BTRFS info (device dm-1): start tree-log replay
This is definitely against RO compact flag requirement.
[CAUSE]
RO compact flag only forces us to do RO mount, but we will still do log
replay for plain RO mount.
Thus this will result us to do log replay and update metadata.
This can be very problematic for new RO compat flag, for example older
kernel can not understand v2 cache, and if we allow metadata update on
RO mount and invalidate/corrupt v2 cache.
[FIX]
Just reject the mount unless rescue=nologreplay is provided:
BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
We don't want to set rescue=nologreply directly, as this would make the
end user to read the old data, and cause confusion.
Since the such case is really rare, we're mostly fine to just reject the
mount with an error message, which also includes the proper workaround.
CC: stable@vger.kernel.org #4.9+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-07 19:48:24 +08:00
goto fail_alloc ;
2023-11-22 12:17:50 -05:00
/*
* At this point our mount options are validated , if we set - > max_inline
* to something non - standard make sure we truncate it to sectorsize .
*/
fs_info - > max_inline = min_t ( u64 , fs_info - > max_inline , fs_info - > sectorsize ) ;
2021-08-17 17:38:51 +08:00
if ( sectorsize < PAGE_SIZE ) {
struct btrfs_subpage_info * subpage_info ;
2021-07-26 14:35:06 +08:00
btrfs_warn ( fs_info ,
" read-write for sector size %u with page size %lu is experimental " ,
sectorsize , PAGE_SIZE ) ;
2021-08-17 17:38:51 +08:00
subpage_info = kzalloc ( sizeof ( * subpage_info ) , GFP_KERNEL ) ;
2023-02-28 08:44:30 +08:00
if ( ! subpage_info ) {
ret = - ENOMEM ;
2021-08-17 17:38:51 +08:00
goto fail_alloc ;
2023-02-28 08:44:30 +08:00
}
2021-08-17 17:38:51 +08:00
btrfs_init_subpage_info ( subpage_info , sectorsize ) ;
fs_info - > subpage_info = subpage_info ;
2021-07-26 14:35:01 +08:00
}
2021-01-26 16:34:02 +08:00
2021-11-10 14:42:17 +08:00
ret = btrfs_init_workqueues ( fs_info ) ;
2023-02-28 08:44:30 +08:00
if ( ret )
2011-11-18 14:37:27 -05:00
goto fail_sb_buffer ;
2008-06-11 21:47:56 -04:00
2017-04-12 12:24:32 +02:00
sb - > s_bdi - > ra_pages * = btrfs_super_num_devices ( disk_super ) ;
sb - > s_bdi - > ra_pages = max ( sb - > s_bdi - > ra_pages , SZ_4M / PAGE_SIZE ) ;
2008-04-18 16:13:31 -04:00
2024-01-16 17:33:20 +01:00
/* Update the values for the current filesystem. */
2008-05-07 11:43:44 -04:00
sb - > s_blocksize = sectorsize ;
sb - > s_blocksize_bits = blksize_bits ( sectorsize ) ;
2018-10-30 16:43:24 +02:00
memcpy ( & sb - > s_uuid , fs_info - > fs_devices - > fsid , BTRFS_FSID_SIZE ) ;
2007-10-15 16:15:53 -04:00
2008-06-25 16:01:30 -04:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2016-06-21 21:16:51 -04:00
ret = btrfs_read_sys_array ( fs_info ) ;
2008-06-25 16:01:30 -04:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2008-04-25 09:04:37 -04:00
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to read the system array: %d " , ret ) ;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
goto fail_sb_buffer ;
2008-04-25 09:04:37 -04:00
}
2008-03-24 15:01:56 -04:00
2008-10-29 14:49:05 -04:00
generation = btrfs_super_chunk_root_generation ( disk_super ) ;
2018-03-29 09:08:11 +08:00
level = btrfs_super_chunk_root_level ( disk_super ) ;
2021-12-15 15:40:06 -05:00
ret = load_super_root ( chunk_root , btrfs_super_chunk_root ( disk_super ) ,
generation , level ) ;
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to read chunk root " ) ;
2011-11-03 15:17:42 -04:00
goto fail_tree_roots ;
2009-07-22 16:52:13 -04:00
}
2008-03-24 15:01:56 -04:00
2008-04-15 15:41:47 -04:00
read_extent_buffer ( chunk_root - > node , fs_info - > chunk_tree_uuid ,
2019-03-20 13:17:13 +01:00
offsetof ( struct btrfs_header , chunk_tree_uuid ) ,
BTRFS_UUID_SIZE ) ;
2008-04-15 15:41:47 -04:00
2016-06-21 10:40:19 -04:00
ret = btrfs_read_chunk_tree ( fs_info ) ;
2008-11-17 21:11:30 -05:00
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to read chunk tree: %d " , ret ) ;
2011-11-03 15:17:42 -04:00
goto fail_tree_roots ;
2008-11-17 21:11:30 -05:00
}
2008-03-24 15:01:56 -04:00
2012-11-06 13:15:27 +01:00
/*
2020-11-06 16:06:33 +08:00
* At this point we know all the devices that make this filesystem ,
* including the seed devices but we don ' t know yet if the replace
* target is required . So free devices that are not part of this
2021-05-21 17:42:23 +02:00
* filesystem but skip the replace target device which is checked
2020-11-06 16:06:33 +08:00
* below in btrfs_init_dev_replace ( ) .
2012-11-06 13:15:27 +01:00
*/
2020-11-06 16:06:33 +08:00
btrfs_free_extra_devids ( fs_devices ) ;
2021-08-24 13:05:19 +08:00
if ( ! fs_devices - > latest_dev - > bdev ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to read devices " ) ;
2023-02-28 08:44:30 +08:00
ret = - EIO ;
2012-02-20 20:53:43 -05:00
goto fail_tree_roots ;
}
2019-10-15 18:42:20 +03:00
ret = init_tree_roots ( fs_info ) ;
2014-08-01 18:12:45 -05:00
if ( ret )
2019-10-15 18:42:20 +03:00
goto fail_tree_roots ;
2010-05-16 10:49:58 -04:00
2021-02-04 19:21:42 +09:00
/*
* Get zone type information of zoned block devices . This will also
* handle emulation of a zoned filesystem if a regular device has the
* zoned incompat feature flag set .
*/
ret = btrfs_get_dev_zone_info_all_devices ( fs_info ) ;
if ( ret ) {
btrfs_err ( fs_info ,
2023-02-28 08:44:30 +08:00
" zoned: failed to read device zone info: %d " , ret ) ;
2021-02-04 19:21:42 +09:00
goto fail_block_groups ;
}
2020-02-14 15:22:06 -05:00
/*
* If we have a uuid root and we ' re not being told to rescan we need to
* check the generation here so we can set the
* BTRFS_FS_UPDATE_UUID_TREE_GEN bit . Otherwise we could commit the
* transaction during a balance or the log replay without updating the
* uuid generation , and then if we crash we would rescan the uuid tree ,
* even though it was perfectly fine .
*/
if ( fs_info - > uuid_root & & ! btrfs_test_opt ( fs_info , RESCAN_UUID_TREE ) & &
fs_info - > generation = = btrfs_super_uuid_tree_generation ( disk_super ) )
set_bit ( BTRFS_FS_UPDATE_UUID_TREE_GEN , & fs_info - > flags ) ;
2018-08-01 10:37:19 +08:00
ret = btrfs_verify_dev_extents ( fs_info ) ;
if ( ret ) {
btrfs_err ( fs_info ,
" failed to verify dev extents against chunks: %d " ,
ret ) ;
goto fail_block_groups ;
}
2012-06-22 12:24:12 -06:00
ret = btrfs_recover_balance ( fs_info ) ;
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to recover balance: %d " , ret ) ;
2012-06-22 12:24:12 -06:00
goto fail_block_groups ;
}
2012-05-25 16:06:10 +02:00
ret = btrfs_init_dev_stats ( fs_info ) ;
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to init dev_stats: %d " , ret ) ;
2012-05-25 16:06:10 +02:00
goto fail_block_groups ;
}
2012-11-06 13:15:27 +01:00
ret = btrfs_init_dev_replace ( fs_info ) ;
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to init dev_replace: %d " , ret ) ;
2012-11-06 13:15:27 +01:00
goto fail_block_groups ;
}
2020-11-10 20:26:08 +09:00
ret = btrfs_check_zoned_mode ( fs_info ) ;
if ( ret ) {
btrfs_err ( fs_info , " failed to initialize zoned mode: %d " ,
ret ) ;
goto fail_block_groups ;
}
2019-11-21 17:33:32 +08:00
ret = btrfs_sysfs_add_fsid ( fs_devices ) ;
2015-03-10 06:38:38 +08:00
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to init sysfs fsid interface: %d " ,
ret ) ;
2015-03-10 06:38:38 +08:00
goto fail_block_groups ;
}
2015-08-14 18:32:46 +08:00
ret = btrfs_sysfs_add_mounted ( fs_info ) ;
2011-03-07 02:13:14 +00:00
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to init sysfs interface: %d " , ret ) ;
2015-03-10 06:38:38 +08:00
goto fail_fsdev_sysfs ;
2011-03-07 02:13:14 +00:00
}
ret = btrfs_init_space_info ( fs_info ) ;
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to initialize space info: %d " , ret ) ;
2014-01-22 11:15:51 +08:00
goto fail_sysfs ;
2011-03-07 02:13:14 +00:00
}
2016-06-21 10:40:19 -04:00
ret = btrfs_read_block_groups ( fs_info ) ;
2010-03-19 20:49:55 +00:00
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_err ( fs_info , " failed to read block groups: %d " , ret ) ;
2014-01-22 11:15:51 +08:00
goto fail_sysfs ;
2010-03-19 20:49:55 +00:00
}
2017-03-09 09:34:37 +08:00
2021-11-11 14:14:38 +09:00
btrfs_free_zone_cache ( fs_info ) ;
2023-08-08 01:12:36 +09:00
btrfs_check_active_zone_reservation ( fs_info ) ;
2021-10-19 18:43:38 +08:00
if ( ! sb_rdonly ( sb ) & & fs_info - > fs_devices - > missing_devices & &
! btrfs_check_rw_degradable ( fs_info , NULL ) ) {
2016-05-09 11:32:39 +02:00
btrfs_warn ( fs_info ,
2018-11-28 12:05:13 +01:00
" writable mount is not allowed due to too many missing devices " ) ;
2023-02-28 08:44:30 +08:00
ret = - EINVAL ;
2014-01-22 11:15:51 +08:00
goto fail_sysfs ;
2012-10-30 17:16:16 +00:00
}
2007-04-26 16:46:15 -04:00
2022-02-18 14:56:11 -05:00
fs_info - > cleaner_kthread = kthread_run ( cleaner_kthread , fs_info ,
2008-06-25 16:01:31 -04:00
" btrfs-cleaner " ) ;
2023-02-28 08:44:30 +08:00
if ( IS_ERR ( fs_info - > cleaner_kthread ) ) {
ret = PTR_ERR ( fs_info - > cleaner_kthread ) ;
2014-01-22 11:15:51 +08:00
goto fail_sysfs ;
2023-02-28 08:44:30 +08:00
}
2008-06-25 16:01:31 -04:00
fs_info - > transaction_kthread = kthread_run ( transaction_kthread ,
tree_root ,
" btrfs-transaction " ) ;
2023-02-28 08:44:30 +08:00
if ( IS_ERR ( fs_info - > transaction_kthread ) ) {
ret = PTR_ERR ( fs_info - > transaction_kthread ) ;
2008-06-25 16:01:31 -04:00
goto fail_cleaner ;
2023-02-28 08:44:30 +08:00
}
2008-06-25 16:01:31 -04:00
2011-09-13 15:23:30 +02:00
ret = btrfs_read_qgroup_config ( fs_info ) ;
if ( ret )
goto fail_trans_kthread ;
2011-11-09 13:44:05 +01:00
2017-09-29 15:43:50 -04:00
if ( btrfs_build_ref_tree ( fs_info ) )
btrfs_err ( fs_info , " couldn't build ref tree " ) ;
2016-01-19 10:23:03 +08:00
/* do not make disk changes in broken FS or nologreplay is given */
if ( btrfs_super_log_root ( disk_super ) ! = 0 & &
2016-06-22 18:54:23 -04:00
! btrfs_test_opt ( fs_info , NOLOGREPLAY ) ) {
2020-02-05 17:12:16 +01:00
btrfs_info ( fs_info , " start tree-log replay " ) ;
2014-08-01 18:12:46 -05:00
ret = btrfs_replay_log ( fs_info , fs_devices ) ;
2023-02-28 08:44:30 +08:00
if ( ret )
2014-04-23 19:33:35 +08:00
goto fail_qgroup ;
2008-09-05 16:13:11 -04:00
}
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
2020-05-15 19:35:55 +02:00
fs_info - > fs_root = btrfs_get_fs_root ( fs_info , BTRFS_FS_TREE_OBJECTID , true ) ;
2010-05-29 09:44:10 +00:00
if ( IS_ERR ( fs_info - > fs_root ) ) {
2023-02-28 08:44:30 +08:00
ret = PTR_ERR ( fs_info - > fs_root ) ;
btrfs_warn ( fs_info , " failed to read fs tree: %d " , ret ) ;
2020-02-13 10:47:28 -05:00
fs_info - > fs_root = NULL ;
2011-09-13 15:23:30 +02:00
goto fail_qgroup ;
2010-05-29 09:44:10 +00:00
}
2009-06-10 09:51:32 -04:00
2017-07-17 08:45:34 +01:00
if ( sb_rdonly ( sb ) )
2023-11-22 12:17:53 -05:00
return 0 ;
2012-01-16 22:04:48 +02:00
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
ret = btrfs_start_pre_rw_mount ( fs_info ) ;
2012-06-22 12:24:13 -06:00
if ( ret ) {
2016-06-21 21:16:51 -04:00
close_ctree ( fs_info ) ;
2012-06-22 12:24:13 -06:00
return ret ;
2010-01-26 14:30:53 +00:00
}
2019-12-13 16:22:14 -08:00
btrfs_discard_resume ( fs_info ) ;
Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-28 15:47:24 +00:00
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:16 -08:00
if ( fs_info - > uuid_root & &
( btrfs_test_opt ( fs_info , RESCAN_UUID_TREE ) | |
fs_info - > generation ! = btrfs_super_uuid_tree_generation ( disk_super ) ) ) {
2016-05-09 11:32:39 +02:00
btrfs_info ( fs_info , " checking UUID tree " ) ;
2013-08-15 17:11:23 +02:00
ret = btrfs_check_uuid_tree ( fs_info ) ;
if ( ret ) {
2016-05-09 11:32:39 +02:00
btrfs_warn ( fs_info ,
" failed to check the UUID tree: %d " , ret ) ;
2016-06-21 21:16:51 -04:00
close_ctree ( fs_info ) ;
2013-08-15 17:11:23 +02:00
return ret ;
}
2013-08-15 17:11:19 +02:00
}
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 15:06:22 -08:00
2016-09-02 15:40:02 -04:00
set_bit ( BTRFS_FS_OPEN , & fs_info - > flags ) ;
2014-09-18 11:20:02 -04:00
2022-02-18 14:56:10 -05:00
/* Kick the cleaner thread so it'll start deleting snapshots. */
if ( test_bit ( BTRFS_FS_UNFINISHED_DROPS , & fs_info - > flags ) )
wake_up_process ( fs_info - > cleaner_kthread ) ;
2011-11-17 01:10:02 -05:00
return 0 ;
2007-06-12 06:35:45 -04:00
2011-09-13 15:23:30 +02:00
fail_qgroup :
btrfs_free_qgroup_config ( fs_info ) ;
2008-11-19 15:13:35 -05:00
fail_trans_kthread :
kthread_stop ( fs_info - > transaction_kthread ) ;
2016-06-22 18:54:24 -04:00
btrfs_cleanup_transaction ( fs_info ) ;
2014-05-07 17:06:09 -04:00
btrfs_free_fs_roots ( fs_info ) ;
2008-06-25 16:01:31 -04:00
fail_cleaner :
2008-06-25 16:01:31 -04:00
kthread_stop ( fs_info - > cleaner_kthread ) ;
2008-11-19 15:13:35 -05:00
/*
* make sure we ' re done with the btree inode before we stop our
* kthreads
*/
filemap_write_and_wait ( fs_info - > btree_inode - > i_mapping ) ;
2014-01-22 11:15:51 +08:00
fail_sysfs :
2015-08-14 18:32:47 +08:00
btrfs_sysfs_remove_mounted ( fs_info ) ;
2014-01-22 11:15:51 +08:00
2015-03-10 06:38:38 +08:00
fail_fsdev_sysfs :
btrfs_sysfs_remove_fsid ( fs_info - > fs_devices ) ;
2010-03-19 20:49:55 +00:00
fail_block_groups :
2013-04-25 13:44:38 -04:00
btrfs_put_block_group_cache ( fs_info ) ;
2011-11-03 15:17:42 -04:00
fail_tree_roots :
2020-09-03 14:29:50 -04:00
if ( fs_info - > data_reloc_root )
btrfs_drop_and_free_fs_root ( fs_info , fs_info - > data_reloc_root ) ;
2019-10-10 10:39:25 +08:00
free_root_pointers ( fs_info , true ) ;
2013-02-07 06:01:35 +00:00
invalidate_inode_pages2 ( fs_info - > btree_inode - > i_mapping ) ;
2011-11-03 15:17:42 -04:00
2007-06-12 06:35:45 -04:00
fail_sb_buffer :
2013-03-17 02:10:31 +00:00
btrfs_stop_all_workers ( fs_info ) ;
2017-02-01 22:39:50 +00:00
btrfs_free_block_groups ( fs_info ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
fail_alloc :
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
btrfs_mapping_tree_free ( fs_info ) ;
2011-11-09 13:26:37 +02:00
2008-06-11 21:47:56 -04:00
iput ( fs_info - > btree_inode ) ;
2009-01-21 10:49:16 -05:00
fail :
2011-11-09 13:26:37 +02:00
btrfs_close_devices ( fs_info - > fs_devices ) ;
2023-02-28 08:44:30 +08:00
ASSERT ( ret < 0 ) ;
return ret ;
2007-02-02 09:18:22 -05:00
}
2018-01-13 02:55:33 +09:00
ALLOW_ERROR_INJECTION ( open_ctree , ERRNO ) ;
2007-02-02 09:18:22 -05:00
2020-02-14 00:24:33 +09:00
static void btrfs_end_super_write ( struct bio * bio )
2008-04-10 16:19:33 -04:00
{
2020-02-14 00:24:33 +09:00
struct btrfs_device * device = bio - > bi_private ;
2024-04-20 03:49:58 +01:00
struct folio_iter fi ;
2020-02-14 00:24:33 +09:00
2024-04-20 03:49:58 +01:00
bio_for_each_folio_all ( fi , bio ) {
2020-02-14 00:24:33 +09:00
if ( bio - > bi_status ) {
btrfs_warn_rl_in_rcu ( device - > fs_info ,
2024-04-20 03:49:58 +01:00
" lost super block write due to IO error on %s (%d) " ,
2022-11-13 09:32:07 +08:00
btrfs_dev_name ( device ) ,
2020-02-14 00:24:33 +09:00
blk_status_to_errno ( bio - > bi_status ) ) ;
btrfs_dev_stat_inc_and_print ( device ,
BTRFS_DEV_STAT_WRITE_ERRS ) ;
2024-04-20 03:49:59 +01:00
/* Ensure failure if the primary sb fails. */
if ( bio - > bi_opf & REQ_FUA )
atomic_add ( BTRFS_SUPER_PRIMARY_WRITE_ERROR ,
& device - > sb_write_errors ) ;
else
atomic_inc ( & device - > sb_write_errors ) ;
2020-02-14 00:24:33 +09:00
}
2024-04-20 03:49:58 +01:00
folio_unlock ( fi . folio ) ;
folio_put ( fi . folio ) ;
2008-04-10 16:19:33 -04:00
}
2020-02-14 00:24:33 +09:00
bio_put ( bio ) ;
2008-04-10 16:19:33 -04:00
}
2020-02-14 00:24:32 +09:00
struct btrfs_super_block * btrfs_read_dev_one_super ( struct block_device * bdev ,
btrfs: check superblock to ensure the fs was not modified at thaw time
[BACKGROUND]
There is an incident report that, one user hibernated the system, with
one btrfs on removable device still mounted.
Then by some incident, the btrfs got mounted and modified by another
system/OS, then back to the hibernated system.
After resuming from the hibernation, new write happened into the victim btrfs.
Now the fs is completely broken, since the underlying btrfs is no longer
the same one before the hibernation, and the user lost their data due to
various transid mismatch.
[REPRODUCER]
We can emulate the situation using the following small script:
truncate -s 1G $dev
mkfs.btrfs -f $dev
mount $dev $mnt
fsstress -w -d $mnt -n 500
sync
xfs_freeze -f $mnt
cp $dev $dev.backup
# There is no way to mount the same cloned fs on the same system,
# as the conflicting fsid will be rejected by btrfs.
# Thus here we have to wipe the fs using a different btrfs.
mkfs.btrfs -f $dev.backup
dd if=$dev.backup of=$dev bs=1M
xfs_freeze -u $mnt
fsstress -w -d $mnt -n 20
umount $mnt
btrfs check $dev
The final fsck will fail due to some tree blocks has incorrect fsid.
This is enough to emulate the problem hit by the unfortunate user.
[ENHANCEMENT]
Although such case should not be that common, it can still happen from
time to time.
From the view of btrfs, we can detect any unexpected super block change,
and if there is any unexpected change, we just mark the fs read-only,
and thaw the fs.
By this we can limit the damage to minimal, and I hope no one would lose
their data by this anymore.
Suggested-by: Goffredo Baroncelli <kreijack@libero.it>
Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 20:16:22 +08:00
int copy_num , bool drop_cache )
2015-08-14 18:32:58 +08:00
{
struct btrfs_super_block * super ;
2020-02-14 00:24:32 +09:00
struct page * page ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
u64 bytenr , bytenr_orig ;
2024-04-11 15:53:37 +01:00
struct address_space * mapping = bdev - > bd_mapping ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
int ret ;
bytenr_orig = btrfs_sb_offset ( copy_num ) ;
ret = btrfs_sb_log_location_bdev ( bdev , copy_num , READ , & bytenr ) ;
if ( ret = = - ENOENT )
return ERR_PTR ( - EINVAL ) ;
else if ( ret )
return ERR_PTR ( ret ) ;
2015-08-14 18:32:58 +08:00
2021-10-18 12:11:12 +02:00
if ( bytenr + BTRFS_SUPER_INFO_SIZE > = bdev_nr_bytes ( bdev ) )
2020-02-14 00:24:32 +09:00
return ERR_PTR ( - EINVAL ) ;
2015-08-14 18:32:58 +08:00
btrfs: check superblock to ensure the fs was not modified at thaw time
[BACKGROUND]
There is an incident report that, one user hibernated the system, with
one btrfs on removable device still mounted.
Then by some incident, the btrfs got mounted and modified by another
system/OS, then back to the hibernated system.
After resuming from the hibernation, new write happened into the victim btrfs.
Now the fs is completely broken, since the underlying btrfs is no longer
the same one before the hibernation, and the user lost their data due to
various transid mismatch.
[REPRODUCER]
We can emulate the situation using the following small script:
truncate -s 1G $dev
mkfs.btrfs -f $dev
mount $dev $mnt
fsstress -w -d $mnt -n 500
sync
xfs_freeze -f $mnt
cp $dev $dev.backup
# There is no way to mount the same cloned fs on the same system,
# as the conflicting fsid will be rejected by btrfs.
# Thus here we have to wipe the fs using a different btrfs.
mkfs.btrfs -f $dev.backup
dd if=$dev.backup of=$dev bs=1M
xfs_freeze -u $mnt
fsstress -w -d $mnt -n 20
umount $mnt
btrfs check $dev
The final fsck will fail due to some tree blocks has incorrect fsid.
This is enough to emulate the problem hit by the unfortunate user.
[ENHANCEMENT]
Although such case should not be that common, it can still happen from
time to time.
From the view of btrfs, we can detect any unexpected super block change,
and if there is any unexpected change, we just mark the fs read-only,
and thaw the fs.
By this we can limit the damage to minimal, and I hope no one would lose
their data by this anymore.
Suggested-by: Goffredo Baroncelli <kreijack@libero.it>
Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 20:16:22 +08:00
if ( drop_cache ) {
/* This should only be called with the primary sb. */
ASSERT ( copy_num = = 0 ) ;
/*
* Drop the page of the primary superblock , so later read will
* always read from the device .
*/
invalidate_inode_pages2_range ( mapping ,
bytenr > > PAGE_SHIFT ,
( bytenr + BTRFS_SUPER_INFO_SIZE ) > > PAGE_SHIFT ) ;
}
2020-02-14 00:24:32 +09:00
page = read_cache_page_gfp ( mapping , bytenr > > PAGE_SHIFT , GFP_NOFS ) ;
if ( IS_ERR ( page ) )
return ERR_CAST ( page ) ;
2015-08-14 18:32:58 +08:00
2020-02-14 00:24:32 +09:00
super = page_address ( page ) ;
2020-09-30 21:09:52 +08:00
if ( btrfs_super_magic ( super ) ! = BTRFS_MAGIC ) {
btrfs_release_disk_super ( super ) ;
return ERR_PTR ( - ENODATA ) ;
}
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
if ( btrfs_super_bytenr ( super ) ! = bytenr_orig ) {
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( super ) ;
return ERR_PTR ( - EINVAL ) ;
2015-08-14 18:32:58 +08:00
}
2020-02-14 00:24:32 +09:00
return super ;
2015-08-14 18:32:58 +08:00
}
2020-02-14 00:24:32 +09:00
struct btrfs_super_block * btrfs_read_dev_super ( struct block_device * bdev )
2008-12-08 16:46:26 -05:00
{
2020-02-14 00:24:32 +09:00
struct btrfs_super_block * super , * latest = NULL ;
2008-12-08 16:46:26 -05:00
int i ;
u64 transid = 0 ;
/* we would like to check all the supers, but that would make
* a btrfs mount succeed after a mkfs from a different FS .
* So , we need to add a special mount option to scan for
* later supers , using BTRFS_SUPER_MIRROR_MAX instead
*/
for ( i = 0 ; i < 1 ; i + + ) {
btrfs: check superblock to ensure the fs was not modified at thaw time
[BACKGROUND]
There is an incident report that, one user hibernated the system, with
one btrfs on removable device still mounted.
Then by some incident, the btrfs got mounted and modified by another
system/OS, then back to the hibernated system.
After resuming from the hibernation, new write happened into the victim btrfs.
Now the fs is completely broken, since the underlying btrfs is no longer
the same one before the hibernation, and the user lost their data due to
various transid mismatch.
[REPRODUCER]
We can emulate the situation using the following small script:
truncate -s 1G $dev
mkfs.btrfs -f $dev
mount $dev $mnt
fsstress -w -d $mnt -n 500
sync
xfs_freeze -f $mnt
cp $dev $dev.backup
# There is no way to mount the same cloned fs on the same system,
# as the conflicting fsid will be rejected by btrfs.
# Thus here we have to wipe the fs using a different btrfs.
mkfs.btrfs -f $dev.backup
dd if=$dev.backup of=$dev bs=1M
xfs_freeze -u $mnt
fsstress -w -d $mnt -n 20
umount $mnt
btrfs check $dev
The final fsck will fail due to some tree blocks has incorrect fsid.
This is enough to emulate the problem hit by the unfortunate user.
[ENHANCEMENT]
Although such case should not be that common, it can still happen from
time to time.
From the view of btrfs, we can detect any unexpected super block change,
and if there is any unexpected change, we just mark the fs read-only,
and thaw the fs.
By this we can limit the damage to minimal, and I hope no one would lose
their data by this anymore.
Suggested-by: Goffredo Baroncelli <kreijack@libero.it>
Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 20:16:22 +08:00
super = btrfs_read_dev_one_super ( bdev , i , false ) ;
2020-02-14 00:24:32 +09:00
if ( IS_ERR ( super ) )
2008-12-08 16:46:26 -05:00
continue ;
if ( ! latest | | btrfs_super_generation ( super ) > transid ) {
2020-02-14 00:24:32 +09:00
if ( latest )
btrfs_release_disk_super ( super ) ;
latest = super ;
2008-12-08 16:46:26 -05:00
transid = btrfs_super_generation ( super ) ;
}
}
2015-08-14 18:32:51 +08:00
2020-02-14 00:24:32 +09:00
return super ;
2008-12-08 16:46:26 -05:00
}
2009-06-10 15:28:55 -04:00
/*
2017-06-16 00:50:33 +02:00
* Write superblock @ sb to the @ device . Do not wait for completion , all the
2024-04-20 03:49:57 +01:00
* folios we use for writing are locked .
2009-06-10 15:28:55 -04:00
*
2017-06-16 00:50:33 +02:00
* Write @ max_mirrors copies of the superblock , where 0 means default that fit
* the expected device size at commit time . Note that max_mirrors must be
* same for write and wait phases .
2009-06-10 15:28:55 -04:00
*
2024-04-20 03:49:57 +01:00
* Return number of errors when folio is not found or submission fails .
2009-06-10 15:28:55 -04:00
*/
2008-12-08 16:46:26 -05:00
static int write_dev_supers ( struct btrfs_device * device ,
2017-06-16 00:50:33 +02:00
struct btrfs_super_block * sb , int max_mirrors )
2008-12-08 16:46:26 -05:00
{
2019-06-03 16:58:57 +02:00
struct btrfs_fs_info * fs_info = device - > fs_info ;
2024-04-11 15:53:37 +01:00
struct address_space * mapping = device - > bdev - > bd_mapping ;
2019-06-03 16:58:57 +02:00
SHASH_DESC_ON_STACK ( shash , fs_info - > csum_shash ) ;
2008-12-08 16:46:26 -05:00
int i ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
int ret ;
u64 bytenr , bytenr_orig ;
2008-12-08 16:46:26 -05:00
2024-04-20 03:49:59 +01:00
atomic_set ( & device - > sb_write_errors , 0 ) ;
2008-12-08 16:46:26 -05:00
if ( max_mirrors = = 0 )
max_mirrors = BTRFS_SUPER_MIRROR_MAX ;
2019-06-03 16:58:57 +02:00
shash - > tfm = fs_info - > csum_shash ;
2008-12-08 16:46:26 -05:00
for ( i = 0 ; i < max_mirrors ; i + + ) {
2024-04-20 03:49:57 +01:00
struct folio * folio ;
2020-02-14 00:24:33 +09:00
struct bio * bio ;
struct btrfs_super_block * disk_super ;
2024-04-20 03:49:57 +01:00
size_t offset ;
2020-02-14 00:24:33 +09:00
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
bytenr_orig = btrfs_sb_offset ( i ) ;
ret = btrfs_sb_log_location ( device , i , WRITE , & bytenr ) ;
if ( ret = = - ENOENT ) {
continue ;
} else if ( ret < 0 ) {
btrfs_err ( device - > fs_info ,
" couldn't get super block location for mirror %d " ,
i ) ;
2024-04-20 03:49:59 +01:00
atomic_inc ( & device - > sb_write_errors ) ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
continue ;
}
2014-09-03 21:35:33 +08:00
if ( bytenr + BTRFS_SUPER_INFO_SIZE > =
device - > commit_total_bytes )
2008-12-08 16:46:26 -05:00
break ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
btrfs_set_super_bytenr ( sb , bytenr_orig ) ;
2009-06-10 15:28:55 -04:00
2020-04-30 23:51:59 -07:00
crypto_shash_digest ( shash , ( const char * ) sb + BTRFS_CSUM_SIZE ,
BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE ,
sb - > csum ) ;
2009-06-10 15:28:55 -04:00
2024-04-20 03:49:57 +01:00
folio = __filemap_get_folio ( mapping , bytenr > > PAGE_SHIFT ,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT ,
GFP_NOFS ) ;
if ( IS_ERR ( folio ) ) {
2017-06-16 00:50:33 +02:00
btrfs_err ( device - > fs_info ,
2020-02-14 00:24:33 +09:00
" couldn't get super block page for bytenr %llu " ,
2017-06-16 00:50:33 +02:00
bytenr ) ;
2024-04-20 03:49:59 +01:00
atomic_inc ( & device - > sb_write_errors ) ;
2009-06-10 15:28:55 -04:00
continue ;
2017-06-16 00:50:33 +02:00
}
2024-04-20 03:49:57 +01:00
ASSERT ( folio_order ( folio ) = = 0 ) ;
2013-04-29 10:05:57 -04:00
2024-04-20 03:49:57 +01:00
offset = offset_in_folio ( folio , bytenr ) ;
disk_super = folio_address ( folio ) + offset ;
2020-02-14 00:24:33 +09:00
memcpy ( disk_super , sb , BTRFS_SUPER_INFO_SIZE ) ;
2009-06-10 15:28:55 -04:00
2020-02-14 00:24:33 +09:00
/*
* Directly use bios here instead of relying on the page cache
* to do I / O , so we don ' t lose the ability to do integrity
* checking .
*/
2022-01-24 10:11:05 +01:00
bio = bio_alloc ( device - > bdev , 1 ,
REQ_OP_WRITE | REQ_SYNC | REQ_META | REQ_PRIO ,
GFP_NOFS ) ;
2020-02-14 00:24:33 +09:00
bio - > bi_iter . bi_sector = bytenr > > SECTOR_SHIFT ;
bio - > bi_private = device ;
bio - > bi_end_io = btrfs_end_super_write ;
2024-04-20 03:49:57 +01:00
bio_add_folio_nofail ( bio , folio , BTRFS_SUPER_INFO_SIZE , offset ) ;
2008-12-08 16:46:26 -05:00
2011-11-18 15:07:51 -05:00
/*
2020-02-14 00:24:33 +09:00
* We FUA only the first super block . The others we allow to
* go down lazy and there ' s a short window where the on - disk
* copies might still contain the older version .
2011-11-18 15:07:51 -05:00
*/
2017-12-05 22:54:02 -08:00
if ( i = = 0 & & ! btrfs_test_opt ( device - > fs_info , NOBARRIER ) )
2020-02-14 00:24:33 +09:00
bio - > bi_opf | = REQ_FUA ;
2022-04-04 06:45:18 +02:00
submit_bio ( bio ) ;
2021-08-19 21:19:14 +09:00
if ( btrfs_advance_sb_log ( device , i ) )
2024-04-20 03:49:59 +01:00
atomic_inc ( & device - > sb_write_errors ) ;
2008-12-08 16:46:26 -05:00
}
2024-04-20 03:49:59 +01:00
return atomic_read ( & device - > sb_write_errors ) < i ? 0 : - 1 ;
2008-12-08 16:46:26 -05:00
}
2017-06-16 00:50:33 +02:00
/*
* Wait for write completion of superblocks done by write_dev_supers ,
* @ max_mirrors same for write and wait phases .
*
2024-04-20 03:49:59 +01:00
* Return - 1 if primary super block write failed or when there were no super block
* copies written . Otherwise 0.
2017-06-16 00:50:33 +02:00
*/
static int wait_dev_supers ( struct btrfs_device * device , int max_mirrors )
{
int i ;
int errors = 0 ;
2018-02-02 11:09:01 -08:00
bool primary_failed = false ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
int ret ;
2017-06-16 00:50:33 +02:00
u64 bytenr ;
if ( max_mirrors = = 0 )
max_mirrors = BTRFS_SUPER_MIRROR_MAX ;
for ( i = 0 ; i < max_mirrors ; i + + ) {
2024-04-20 03:49:56 +01:00
struct folio * folio ;
2020-02-14 00:24:33 +09:00
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
ret = btrfs_sb_log_location ( device , i , READ , & bytenr ) ;
if ( ret = = - ENOENT ) {
break ;
} else if ( ret < 0 ) {
errors + + ;
if ( i = = 0 )
primary_failed = true ;
continue ;
}
2017-06-16 00:50:33 +02:00
if ( bytenr + BTRFS_SUPER_INFO_SIZE > =
device - > commit_total_bytes )
break ;
2024-05-21 09:51:42 -07:00
folio = filemap_get_folio ( device - > bdev - > bd_mapping ,
2024-04-20 03:49:56 +01:00
bytenr > > PAGE_SHIFT ) ;
2024-04-20 03:49:59 +01:00
/* If the folio has been removed, then we know it completed. */
if ( IS_ERR ( folio ) )
2017-06-16 00:50:33 +02:00
continue ;
2024-04-20 03:49:56 +01:00
ASSERT ( folio_order ( folio ) = = 0 ) ;
/* Folio will be unlocked once the write completes. */
folio_wait_locked ( folio ) ;
folio_put ( folio ) ;
2017-06-16 00:50:33 +02:00
}
2024-04-20 03:49:59 +01:00
errors + = atomic_read ( & device - > sb_write_errors ) ;
if ( errors > = BTRFS_SUPER_PRIMARY_WRITE_ERROR )
primary_failed = true ;
2018-02-02 11:09:01 -08:00
if ( primary_failed ) {
btrfs_err ( device - > fs_info , " error writing primary super block to device %llu " ,
device - > devid ) ;
return - 1 ;
}
2017-06-16 00:50:33 +02:00
return errors < i ? 0 : - 1 ;
}
2011-11-18 15:07:51 -05:00
/*
* endio for the write_dev_flush , this will wake anyone waiting
* for the barrier when it is done
*/
2015-07-20 15:29:37 +02:00
static void btrfs_end_empty_barrier ( struct bio * bio )
2011-11-18 15:07:51 -05:00
{
2022-04-06 08:12:24 +02:00
bio_uninit ( bio ) ;
2017-06-06 17:06:06 +02:00
complete ( bio - > bi_private ) ;
2011-11-18 15:07:51 -05:00
}
/*
2017-06-13 17:05:41 +08:00
* Submit a flush request to the device if it supports it . Error handling is
* done in the waiting counterpart .
2011-11-18 15:07:51 -05:00
*/
2017-06-13 17:05:41 +08:00
static void write_dev_flush ( struct btrfs_device * device )
2011-11-18 15:07:51 -05:00
{
2022-04-06 08:12:24 +02:00
struct bio * bio = & device - > flush_bio ;
2011-11-18 15:07:51 -05:00
2023-03-27 17:53:07 +08:00
device - > last_flush_error = BLK_STS_OK ;
2022-04-06 08:12:24 +02:00
bio_init ( bio , device - > bdev , NULL , 0 ,
REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH ) ;
2011-11-18 15:07:51 -05:00
bio - > bi_end_io = btrfs_end_empty_barrier ;
init_completion ( & device - > flush_wait ) ;
bio - > bi_private = & device - > flush_wait ;
2022-04-04 06:45:18 +02:00
submit_bio ( bio ) ;
2017-12-04 12:54:56 +08:00
set_bit ( BTRFS_DEV_STATE_FLUSH_SENT , & device - > dev_state ) ;
2017-06-13 17:05:41 +08:00
}
2011-11-18 15:07:51 -05:00
2017-06-13 17:05:41 +08:00
/*
* If the flush bio has been submitted by write_dev_flush , wait for it .
2023-03-27 17:53:09 +08:00
* Return true for any error , and false otherwise .
2017-06-13 17:05:41 +08:00
*/
2023-03-27 17:53:09 +08:00
static bool wait_dev_flush ( struct btrfs_device * device )
2017-06-13 17:05:41 +08:00
{
2022-04-06 08:12:24 +02:00
struct bio * bio = & device - > flush_bio ;
2011-11-18 15:07:51 -05:00
2023-03-27 17:53:10 +08:00
if ( ! test_and_clear_bit ( BTRFS_DEV_STATE_FLUSH_SENT , & device - > dev_state ) )
2023-03-27 17:53:09 +08:00
return false ;
2011-11-18 15:07:51 -05:00
2017-06-15 16:04:26 +02:00
wait_for_completion_io ( & device - > flush_wait ) ;
2011-11-18 15:07:51 -05:00
2023-03-27 17:53:07 +08:00
if ( bio - > bi_status ) {
device - > last_flush_error = bio - > bi_status ;
btrfs_dev_stat_inc_and_print ( device , BTRFS_DEV_STAT_FLUSH_ERRS ) ;
2023-03-27 17:53:09 +08:00
return true ;
2023-03-27 17:53:07 +08:00
}
2023-03-27 17:53:09 +08:00
return false ;
2011-11-18 15:07:51 -05:00
}
/*
* send an empty flush down to each device in parallel ,
* then wait for them
*/
static int barrier_all_devices ( struct btrfs_fs_info * info )
{
struct list_head * head ;
struct btrfs_device * dev ;
2012-08-01 18:56:49 +02:00
int errors_wait = 0 ;
2011-11-18 15:07:51 -05:00
2017-06-16 00:28:47 +02:00
lockdep_assert_held ( & info - > fs_devices - > device_list_mutex ) ;
2011-11-18 15:07:51 -05:00
/* send down all the barriers */
head = & info - > fs_devices - > devices ;
2017-06-16 00:28:47 +02:00
list_for_each_entry ( dev , head , dev_list ) {
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING , & dev - > dev_state ) )
2014-02-05 16:34:38 +09:00
continue ;
2017-06-13 17:05:40 +08:00
if ( ! dev - > bdev )
2011-11-18 15:07:51 -05:00
continue ;
2017-12-04 12:54:53 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & dev - > dev_state ) | |
2017-12-04 12:54:52 +08:00
! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & dev - > dev_state ) )
2011-11-18 15:07:51 -05:00
continue ;
2017-06-13 17:05:41 +08:00
write_dev_flush ( dev ) ;
2011-11-18 15:07:51 -05:00
}
/* wait for all the barriers */
2017-06-16 00:28:47 +02:00
list_for_each_entry ( dev , head , dev_list ) {
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING , & dev - > dev_state ) )
2014-02-05 16:34:38 +09:00
continue ;
2011-11-18 15:07:51 -05:00
if ( ! dev - > bdev ) {
2012-08-01 18:56:49 +02:00
errors_wait + + ;
2011-11-18 15:07:51 -05:00
continue ;
}
2017-12-04 12:54:53 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & dev - > dev_state ) | |
2017-12-04 12:54:52 +08:00
! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & dev - > dev_state ) )
2011-11-18 15:07:51 -05:00
continue ;
2023-03-27 17:53:09 +08:00
if ( wait_dev_flush ( dev ) )
2012-08-01 18:56:49 +02:00
errors_wait + + ;
2017-05-06 07:17:54 +08:00
}
2023-03-27 17:53:08 +08:00
/*
* Checks last_flush_error of disks in order to determine the device
* state .
*/
if ( errors_wait & & ! btrfs_check_rw_degradable ( info , NULL ) )
return - EIO ;
2011-11-18 15:07:51 -05:00
return 0 ;
}
2015-08-19 15:54:15 +08:00
int btrfs_get_num_tolerated_disk_barrier_failures ( u64 flags )
{
2015-09-15 21:08:07 +08:00
int raid_type ;
int min_tolerated = INT_MAX ;
2015-08-19 15:54:15 +08:00
2015-09-15 21:08:07 +08:00
if ( ( flags & BTRFS_BLOCK_GROUP_PROFILE_MASK ) = = 0 | |
( flags & BTRFS_AVAIL_ALLOC_BIT_SINGLE ) )
2019-05-17 11:43:36 +02:00
min_tolerated = min_t ( int , min_tolerated ,
2015-09-15 21:08:07 +08:00
btrfs_raid_array [ BTRFS_RAID_SINGLE ] .
tolerated_failures ) ;
2015-08-19 15:54:15 +08:00
2015-09-15 21:08:07 +08:00
for ( raid_type = 0 ; raid_type < BTRFS_NR_RAID_TYPES ; raid_type + + ) {
if ( raid_type = = BTRFS_RAID_SINGLE )
continue ;
2018-04-25 19:01:43 +08:00
if ( ! ( flags & btrfs_raid_array [ raid_type ] . bg_flag ) )
2015-09-15 21:08:07 +08:00
continue ;
2019-05-17 11:43:36 +02:00
min_tolerated = min_t ( int , min_tolerated ,
2015-09-15 21:08:07 +08:00
btrfs_raid_array [ raid_type ] .
tolerated_failures ) ;
}
2015-08-19 15:54:15 +08:00
2015-09-15 21:08:07 +08:00
if ( min_tolerated = = INT_MAX ) {
2016-09-20 10:05:02 -04:00
pr_warn ( " BTRFS: unknown raid flag: %llu " , flags ) ;
2015-09-15 21:08:07 +08:00
min_tolerated = 0 ;
}
return min_tolerated ;
2015-08-19 15:54:15 +08:00
}
2017-02-10 19:04:32 +01:00
int write_all_supers ( struct btrfs_fs_info * fs_info , int max_mirrors )
2008-04-10 16:19:33 -04:00
{
2009-06-10 15:17:02 -04:00
struct list_head * head ;
2008-04-10 16:19:33 -04:00
struct btrfs_device * dev ;
2008-05-07 11:43:44 -04:00
struct btrfs_super_block * sb ;
2008-04-10 16:19:33 -04:00
struct btrfs_dev_item * dev_item ;
int ret ;
int do_barriers ;
2008-04-29 09:38:00 -04:00
int max_errors ;
int total_errors = 0 ;
2008-05-07 11:43:44 -04:00
u64 flags ;
2008-04-10 16:19:33 -04:00
2016-06-22 18:54:23 -04:00
do_barriers = ! btrfs_test_opt ( fs_info , NOBARRIER ) ;
2017-09-13 12:25:21 -06:00
/*
* max_mirrors = = 0 indicates we ' re from commit_transaction ,
* not from fsync where the tree roots in fs_info have not
* been consistent on disk .
*/
if ( max_mirrors = = 0 )
backup_super_roots ( fs_info ) ;
2008-04-10 16:19:33 -04:00
2016-06-22 18:54:23 -04:00
sb = fs_info - > super_for_commit ;
2008-05-07 11:43:44 -04:00
dev_item = & sb - > dev_item ;
2009-06-10 15:17:02 -04:00
2016-06-22 18:54:23 -04:00
mutex_lock ( & fs_info - > fs_devices - > device_list_mutex ) ;
head = & fs_info - > fs_devices - > devices ;
max_errors = btrfs_super_num_devices ( fs_info - > super_copy ) - 1 ;
2011-11-18 15:07:51 -05:00
2012-08-01 18:56:49 +02:00
if ( do_barriers ) {
2016-06-22 18:54:23 -04:00
ret = barrier_all_devices ( fs_info ) ;
2012-08-01 18:56:49 +02:00
if ( ret ) {
mutex_unlock (
2016-06-22 18:54:23 -04:00
& fs_info - > fs_devices - > device_list_mutex ) ;
btrfs_handle_fs_error ( fs_info , ret ,
" errors while submitting device barriers. " ) ;
2012-08-01 18:56:49 +02:00
return ret ;
}
}
2011-11-18 15:07:51 -05:00
2017-06-16 00:28:47 +02:00
list_for_each_entry ( dev , head , dev_list ) {
2008-05-13 13:46:40 -04:00
if ( ! dev - > bdev ) {
total_errors + + ;
continue ;
}
2017-12-04 12:54:53 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & dev - > dev_state ) | |
2017-12-04 12:54:52 +08:00
! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & dev - > dev_state ) )
2008-05-13 13:46:40 -04:00
continue ;
2008-11-17 21:11:30 -05:00
btrfs_set_stack_device_generation ( dev_item , 0 ) ;
2008-05-07 11:43:44 -04:00
btrfs_set_stack_device_type ( dev_item , dev - > type ) ;
btrfs_set_stack_device_id ( dev_item , dev - > devid ) ;
2014-07-24 11:37:13 +08:00
btrfs_set_stack_device_total_bytes ( dev_item ,
2014-09-03 21:35:33 +08:00
dev - > commit_total_bytes ) ;
2014-09-03 21:35:34 +08:00
btrfs_set_stack_device_bytes_used ( dev_item ,
dev - > commit_bytes_used ) ;
2008-05-07 11:43:44 -04:00
btrfs_set_stack_device_io_align ( dev_item , dev - > io_align ) ;
btrfs_set_stack_device_io_width ( dev_item , dev - > io_width ) ;
btrfs_set_stack_device_sector_size ( dev_item , dev - > sector_size ) ;
memcpy ( dev_item - > uuid , dev - > uuid , BTRFS_UUID_SIZE ) ;
2018-10-30 16:43:23 +02:00
memcpy ( dev_item - > fsid , dev - > fs_devices - > metadata_uuid ,
BTRFS_FSID_SIZE ) ;
2008-12-08 16:46:26 -05:00
2008-05-07 11:43:44 -04:00
flags = btrfs_super_flags ( sb ) ;
btrfs_set_super_flags ( sb , flags | BTRFS_HEADER_FLAG_WRITTEN ) ;
btrfs: Do super block verification before writing it to disk
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-11 13:35:27 +08:00
ret = btrfs_validate_write_super ( fs_info , sb ) ;
if ( ret < 0 ) {
mutex_unlock ( & fs_info - > fs_devices - > device_list_mutex ) ;
btrfs_handle_fs_error ( fs_info , - EUCLEAN ,
" unexpected superblock corruption detected " ) ;
return - EUCLEAN ;
}
2017-06-16 00:50:33 +02:00
ret = write_dev_supers ( dev , sb , max_mirrors ) ;
2008-04-29 09:38:00 -04:00
if ( ret )
total_errors + + ;
2008-04-10 16:19:33 -04:00
}
2008-04-29 09:38:00 -04:00
if ( total_errors > max_errors ) {
2016-06-22 18:54:23 -04:00
btrfs_err ( fs_info , " %d errors while writing supers " ,
total_errors ) ;
mutex_unlock ( & fs_info - > fs_devices - > device_list_mutex ) ;
2012-03-12 16:03:00 +01:00
2013-08-09 17:08:40 +02:00
/* FUA is masked off if unsupported and can't be the reason */
2016-06-22 18:54:23 -04:00
btrfs_handle_fs_error ( fs_info , - EIO ,
" %d errors while writing supers " ,
total_errors ) ;
2013-08-09 17:08:40 +02:00
return - EIO ;
2008-04-29 09:38:00 -04:00
}
2008-04-10 16:19:33 -04:00
2008-12-08 16:46:26 -05:00
total_errors = 0 ;
2017-06-16 00:28:47 +02:00
list_for_each_entry ( dev , head , dev_list ) {
2008-05-13 13:46:40 -04:00
if ( ! dev - > bdev )
continue ;
2017-12-04 12:54:53 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & dev - > dev_state ) | |
2017-12-04 12:54:52 +08:00
! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & dev - > dev_state ) )
2008-05-13 13:46:40 -04:00
continue ;
2017-06-16 00:50:33 +02:00
ret = wait_dev_supers ( dev , max_mirrors ) ;
2008-12-08 16:46:26 -05:00
if ( ret )
total_errors + + ;
2008-04-10 16:19:33 -04:00
}
2016-06-22 18:54:23 -04:00
mutex_unlock ( & fs_info - > fs_devices - > device_list_mutex ) ;
2008-04-29 09:38:00 -04:00
if ( total_errors > max_errors ) {
2016-06-22 18:54:23 -04:00
btrfs_handle_fs_error ( fs_info , - EIO ,
" %d errors while writing supers " ,
total_errors ) ;
2012-03-12 16:03:00 +01:00
return - EIO ;
2008-04-29 09:38:00 -04:00
}
2008-04-10 16:19:33 -04:00
return 0 ;
}
2013-05-15 07:48:19 +00:00
/* Drop a fs root from the radix tree and free it. */
void btrfs_drop_and_free_fs_root ( struct btrfs_fs_info * fs_info ,
struct btrfs_root * root )
2007-04-10 16:58:11 -04:00
{
2020-02-14 16:11:45 -05:00
bool drop_ref = false ;
2022-07-15 13:59:21 +02:00
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
radix_tree_delete ( & fs_info - > fs_roots_radix ,
2024-04-15 16:16:23 -04:00
( unsigned long ) btrfs_root_id ( root ) ) ;
2022-07-15 13:59:21 +02:00
if ( test_and_clear_bit ( BTRFS_ROOT_IN_RADIX , & root - > state ) )
2020-02-14 16:11:45 -05:00
drop_ref = true ;
2022-07-15 13:59:21 +02:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2009-09-21 16:00:26 -04:00
2021-10-05 16:35:25 -04:00
if ( BTRFS_FS_ERROR ( fs_info ) ) {
2020-03-24 10:47:52 -04:00
ASSERT ( root - > log_root = = NULL ) ;
2016-07-19 15:36:05 -07:00
if ( root - > reloc_root ) {
2020-01-24 09:33:01 -05:00
btrfs_put_root ( root - > reloc_root ) ;
2016-07-19 15:36:05 -07:00
root - > reloc_root = NULL ;
}
}
2013-02-27 13:28:24 +00:00
2020-02-14 16:11:45 -05:00
if ( drop_ref )
btrfs_put_root ( root ) ;
2007-04-10 16:58:11 -04:00
}
2016-06-21 21:16:51 -04:00
int btrfs_commit_super ( struct btrfs_fs_info * fs_info )
2008-11-12 14:34:12 -05:00
{
2016-06-22 18:54:23 -04:00
mutex_lock ( & fs_info - > cleaner_mutex ) ;
2016-06-22 18:54:24 -04:00
btrfs_run_delayed_iputs ( fs_info ) ;
2016-06-22 18:54:23 -04:00
mutex_unlock ( & fs_info - > cleaner_mutex ) ;
wake_up_process ( fs_info - > cleaner_kthread ) ;
2009-11-12 09:34:40 +00:00
/* wait until ongoing cleanup work done */
2016-06-22 18:54:23 -04:00
down_write ( & fs_info - > cleanup_work_sem ) ;
up_write ( & fs_info - > cleanup_work_sem ) ;
2009-11-12 09:34:40 +00:00
2024-05-22 09:26:44 +01:00
return btrfs_commit_current_transaction ( fs_info - > tree_root ) ;
2008-11-12 14:34:12 -05:00
}
2021-12-16 19:47:36 +08:00
static void warn_about_uncommitted_trans ( struct btrfs_fs_info * fs_info )
{
struct btrfs_transaction * trans ;
struct btrfs_transaction * tmp ;
bool found = false ;
/*
* This function is only called at the very end of close_ctree ( ) ,
* thus no other running transaction , no need to take trans_lock .
*/
ASSERT ( test_bit ( BTRFS_FS_CLOSING_DONE , & fs_info - > flags ) ) ;
list_for_each_entry_safe ( trans , tmp , & fs_info - > trans_list , list ) {
struct extent_state * cached = NULL ;
u64 dirty_bytes = 0 ;
u64 cur = 0 ;
u64 found_start ;
u64 found_end ;
found = true ;
2023-06-30 16:03:49 +01:00
while ( find_first_extent_bit ( & trans - > dirty_pages , cur ,
2021-12-16 19:47:36 +08:00
& found_start , & found_end , EXTENT_DIRTY , & cached ) ) {
dirty_bytes + = found_end + 1 - found_start ;
cur = found_end + 1 ;
}
btrfs_warn ( fs_info ,
" transaction %llu (with %llu dirty metadata bytes) is not committed " ,
trans - > transid , dirty_bytes ) ;
btrfs_cleanup_one_transaction ( trans , fs_info ) ;
if ( trans = = fs_info - > running_transaction )
fs_info - > running_transaction = NULL ;
list_del_init ( & trans - > list ) ;
btrfs_put_transaction ( trans ) ;
trace_btrfs_transaction_commit ( fs_info ) ;
}
ASSERT ( ! found ) ;
}
2019-10-01 19:57:35 +02:00
void __cold close_ctree ( struct btrfs_fs_info * fs_info )
2008-11-12 14:34:12 -05:00
{
int ret ;
2016-09-02 15:40:02 -04:00
set_bit ( BTRFS_FS_CLOSING_START , & fs_info - > flags ) ;
btrfs: fix hang during unmount when block group reclaim task is running
When we start an unmount, at close_ctree(), if we have the reclaim task
running and in the middle of a data block group relocation, we can trigger
a deadlock when stopping an async reclaim task, producing a trace like the
following:
[629724.498185] task:kworker/u16:7 state:D stack: 0 pid:681170 ppid: 2 flags:0x00004000
[629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[629724.501267] Call Trace:
[629724.501759] <TASK>
[629724.502174] __schedule+0x3cb/0xed0
[629724.502842] schedule+0x4e/0xb0
[629724.503447] btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
[629724.504534] ? prepare_to_wait_exclusive+0xc0/0xc0
[629724.505442] flush_space+0x423/0x630 [btrfs]
[629724.506296] ? rcu_read_unlock_trace_special+0x20/0x50
[629724.507259] ? lock_release+0x220/0x4a0
[629724.507932] ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
[629724.508940] ? do_raw_spin_unlock+0x4b/0xa0
[629724.509688] btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
[629724.510922] process_one_work+0x252/0x5a0
[629724.511694] ? process_one_work+0x5a0/0x5a0
[629724.512508] worker_thread+0x52/0x3b0
[629724.513220] ? process_one_work+0x5a0/0x5a0
[629724.514021] kthread+0xf2/0x120
[629724.514627] ? kthread_complete_and_exit+0x20/0x20
[629724.515526] ret_from_fork+0x22/0x30
[629724.516236] </TASK>
[629724.516694] task:umount state:D stack: 0 pid:719055 ppid:695412 flags:0x00004000
[629724.518269] Call Trace:
[629724.518746] <TASK>
[629724.519160] __schedule+0x3cb/0xed0
[629724.519835] schedule+0x4e/0xb0
[629724.520467] schedule_timeout+0xed/0x130
[629724.521221] ? lock_release+0x220/0x4a0
[629724.521946] ? lock_acquired+0x19c/0x420
[629724.522662] ? trace_hardirqs_on+0x1b/0xe0
[629724.523411] __wait_for_common+0xaf/0x1f0
[629724.524189] ? usleep_range_state+0xb0/0xb0
[629724.524997] __flush_work+0x26d/0x530
[629724.525698] ? flush_workqueue_prep_pwqs+0x140/0x140
[629724.526580] ? lock_acquire+0x1a0/0x310
[629724.527324] __cancel_work_timer+0x137/0x1c0
[629724.528190] close_ctree+0xfd/0x531 [btrfs]
[629724.529000] ? evict_inodes+0x166/0x1c0
[629724.529510] generic_shutdown_super+0x74/0x120
[629724.530103] kill_anon_super+0x14/0x30
[629724.530611] btrfs_kill_super+0x12/0x20 [btrfs]
[629724.531246] deactivate_locked_super+0x31/0xa0
[629724.531817] cleanup_mnt+0x147/0x1c0
[629724.532319] task_work_run+0x5c/0xa0
[629724.532984] exit_to_user_mode_prepare+0x1a6/0x1b0
[629724.533598] syscall_exit_to_user_mode+0x16/0x40
[629724.534200] do_syscall_64+0x48/0x90
[629724.534667] entry_SYSCALL_64_after_hwframe+0x44/0xae
[629724.535318] RIP: 0033:0x7fa2b90437a7
[629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
[629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
[629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
[629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
[629724.541796] </TASK>
This happens because:
1) Before entering close_ctree() we have the async block group reclaim
task running and relocating a data block group;
2) There's an async metadata (or data) space reclaim task running;
3) We enter close_ctree() and park the cleaner kthread;
4) The async space reclaim task is at flush_space() and runs all the
existing delayed iputs;
5) Before the async space reclaim task calls
btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
doing the data block group relocation, creates a delayed iput at
replace_file_extents() (called when COWing leaves that have file extent
items pointing to relocated data extents, during the merging phase
of relocation roots);
6) The async reclaim space reclaim task blocks at
btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;
7) The task at close_ctree() then calls cancel_work_sync() to stop the
async space reclaim task, but it blocks since that task is waiting for
the delayed iput to be run;
8) The delayed iput is never run because the cleaner kthread is parked,
and no one else runs delayed iputs, resulting in a hang.
So fix this by stopping the async block group reclaim task before we
park the cleaner kthread.
Fixes: 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-18 10:41:48 +01:00
2022-09-08 12:31:50 +01:00
/*
* If we had UNFINISHED_DROPS we could still be processing them , so
* clear that bit and wake up relocation so it can stop .
* We must do this before stopping the block group reclaim task , because
* at btrfs_relocate_block_group ( ) we wait for this bit , and after the
* wait we stop with - EINTR if btrfs_fs_closing ( ) returns non - zero - we
* have just set BTRFS_FS_CLOSING_START , so btrfs_fs_closing ( ) will
* return 1.
*/
btrfs_wake_unfinished_drop ( fs_info ) ;
btrfs: fix hang during unmount when block group reclaim task is running
When we start an unmount, at close_ctree(), if we have the reclaim task
running and in the middle of a data block group relocation, we can trigger
a deadlock when stopping an async reclaim task, producing a trace like the
following:
[629724.498185] task:kworker/u16:7 state:D stack: 0 pid:681170 ppid: 2 flags:0x00004000
[629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[629724.501267] Call Trace:
[629724.501759] <TASK>
[629724.502174] __schedule+0x3cb/0xed0
[629724.502842] schedule+0x4e/0xb0
[629724.503447] btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
[629724.504534] ? prepare_to_wait_exclusive+0xc0/0xc0
[629724.505442] flush_space+0x423/0x630 [btrfs]
[629724.506296] ? rcu_read_unlock_trace_special+0x20/0x50
[629724.507259] ? lock_release+0x220/0x4a0
[629724.507932] ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
[629724.508940] ? do_raw_spin_unlock+0x4b/0xa0
[629724.509688] btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
[629724.510922] process_one_work+0x252/0x5a0
[629724.511694] ? process_one_work+0x5a0/0x5a0
[629724.512508] worker_thread+0x52/0x3b0
[629724.513220] ? process_one_work+0x5a0/0x5a0
[629724.514021] kthread+0xf2/0x120
[629724.514627] ? kthread_complete_and_exit+0x20/0x20
[629724.515526] ret_from_fork+0x22/0x30
[629724.516236] </TASK>
[629724.516694] task:umount state:D stack: 0 pid:719055 ppid:695412 flags:0x00004000
[629724.518269] Call Trace:
[629724.518746] <TASK>
[629724.519160] __schedule+0x3cb/0xed0
[629724.519835] schedule+0x4e/0xb0
[629724.520467] schedule_timeout+0xed/0x130
[629724.521221] ? lock_release+0x220/0x4a0
[629724.521946] ? lock_acquired+0x19c/0x420
[629724.522662] ? trace_hardirqs_on+0x1b/0xe0
[629724.523411] __wait_for_common+0xaf/0x1f0
[629724.524189] ? usleep_range_state+0xb0/0xb0
[629724.524997] __flush_work+0x26d/0x530
[629724.525698] ? flush_workqueue_prep_pwqs+0x140/0x140
[629724.526580] ? lock_acquire+0x1a0/0x310
[629724.527324] __cancel_work_timer+0x137/0x1c0
[629724.528190] close_ctree+0xfd/0x531 [btrfs]
[629724.529000] ? evict_inodes+0x166/0x1c0
[629724.529510] generic_shutdown_super+0x74/0x120
[629724.530103] kill_anon_super+0x14/0x30
[629724.530611] btrfs_kill_super+0x12/0x20 [btrfs]
[629724.531246] deactivate_locked_super+0x31/0xa0
[629724.531817] cleanup_mnt+0x147/0x1c0
[629724.532319] task_work_run+0x5c/0xa0
[629724.532984] exit_to_user_mode_prepare+0x1a6/0x1b0
[629724.533598] syscall_exit_to_user_mode+0x16/0x40
[629724.534200] do_syscall_64+0x48/0x90
[629724.534667] entry_SYSCALL_64_after_hwframe+0x44/0xae
[629724.535318] RIP: 0033:0x7fa2b90437a7
[629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
[629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
[629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
[629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
[629724.541796] </TASK>
This happens because:
1) Before entering close_ctree() we have the async block group reclaim
task running and relocating a data block group;
2) There's an async metadata (or data) space reclaim task running;
3) We enter close_ctree() and park the cleaner kthread;
4) The async space reclaim task is at flush_space() and runs all the
existing delayed iputs;
5) Before the async space reclaim task calls
btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
doing the data block group relocation, creates a delayed iput at
replace_file_extents() (called when COWing leaves that have file extent
items pointing to relocated data extents, during the merging phase
of relocation roots);
6) The async reclaim space reclaim task blocks at
btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;
7) The task at close_ctree() then calls cancel_work_sync() to stop the
async space reclaim task, but it blocks since that task is waiting for
the delayed iput to be run;
8) The delayed iput is never run because the cleaner kthread is parked,
and no one else runs delayed iputs, resulting in a hang.
So fix this by stopping the async block group reclaim task before we
park the cleaner kthread.
Fixes: 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-18 10:41:48 +01:00
/*
* We may have the reclaim task running and relocating a data block group ,
* in which case it may create delayed iputs . So stop it before we park
* the cleaner kthread otherwise we can get new delayed iputs after
* parking the cleaner , and that can make the async reclaim task to hang
* if it ' s waiting for delayed iputs to complete , since the cleaner is
* parked and can not run delayed iputs - this will make us hang when
* trying to stop the async reclaim task .
*/
cancel_work_sync ( & fs_info - > reclaim_bgs_work ) ;
Btrfs: fix missing delayed iputs on unmount
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-31 10:06:08 -07:00
/*
* We don ' t want the cleaner to start new transactions , add more delayed
* iputs , etc . while we ' re closing . We can ' t use kthread_stop ( ) yet
* because that frees the task_struct , and the transaction kthread might
* still try to wake up the cleaner .
*/
kthread_park ( fs_info - > cleaner_kthread ) ;
2008-11-12 14:34:12 -05:00
2015-11-04 15:56:16 -08:00
/* wait for the qgroup rescan worker to stop */
2016-08-08 22:08:06 -04:00
btrfs_qgroup_wait_for_completion ( fs_info , false ) ;
2015-11-04 15:56:16 -08:00
2013-08-15 17:11:21 +02:00
/* wait for the uuid_scan task to finish */
down ( & fs_info - > uuid_tree_rescan_sem ) ;
/* avoid complains from lockdep et al., set sem back to initial state */
up ( & fs_info - > uuid_tree_rescan_sem ) ;
2012-01-16 22:04:49 +02:00
/* pause restriper - we want to resume on mount */
2012-11-05 17:03:39 +01:00
btrfs_pause_balance ( fs_info ) ;
2012-01-16 22:04:49 +02:00
2012-11-06 13:15:27 +01:00
btrfs_dev_replace_suspend_for_unmount ( fs_info ) ;
2012-11-05 17:03:39 +01:00
btrfs_scrub_cancel ( fs_info ) ;
2011-05-24 15:35:30 -04:00
/* wait for any defraggers to finish */
wait_event ( fs_info - > transaction_wait ,
( atomic_read ( & fs_info - > defrag_running ) = = 0 ) ) ;
/* clear out the rbtree of defraggable inodes */
2012-11-26 09:26:20 +00:00
btrfs_cleanup_defrag_inodes ( fs_info ) ;
2011-05-24 15:35:30 -04:00
btrfs: fix hang during unmount when stopping a space reclaim worker
Often when running generic/562 from fstests we can hang during unmount,
resulting in a trace like this:
Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1
Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000
Sep 07 11:55:32 debian9 kernel: Call Trace:
Sep 07 11:55:32 debian9 kernel: <TASK>
Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0
Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70
Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0
Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130
Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0
Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420
Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0
Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200
Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0
Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530
Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140
Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30
Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0
Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170
Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0
Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120
Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30
Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs]
Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0
Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160
Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0
Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0
Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40
Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90
Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
Sep 07 11:55:32 debian9 kernel: </TASK>
What happens is the following:
1) The cleaner kthread tries to start a transaction to delete an unused
block group, but the metadata reservation can not be satisfied right
away, so a reservation ticket is created and it starts the async
metadata reclaim task (fs_info->async_reclaim_work);
2) Writeback for all the filler inodes with an i_size of 2K starts
(generic/562 creates a lot of 2K files with the goal of filling
metadata space). We try to create an inline extent for them, but we
fail when trying to insert the inline extent with -ENOSPC (at
cow_file_range_inline()) - since this is not critical, we fallback
to non-inline mode (back to cow_file_range()), reserve extents, create
extent maps and create the ordered extents;
3) An unmount starts, enters close_ctree();
4) The async reclaim task is flushing stuff, entering the flush states one
by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
delayed iputs.
After running the delayed iputs and before calling
btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
and btrfs_add_delayed_iput() is called for each one through
btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
for fs_info->nr_delayed_iputs to become 0;
5) The current transaction is committed by the transaction kthread, we then
start unpinning extents and end up calling btrfs_try_granting_tickets()
through unpin_extent_range(), since we released some space.
This results in satisfying the ticket created by the cleaner kthread at
step 1, waking up the cleaner kthread;
6) At close_ctree() we ask the cleaner kthread to park;
7) The cleaner kthread starts the transaction, deletes the unused block
group, and then calls kthread_should_park(), which returns true, so it
parks. And at this point we have the delayed iputs added by the
completion of the ordered extents still pending;
8) Then later at close_ctree(), when we call:
cancel_work_sync(&fs_info->async_reclaim_work);
We hang forever, since the cleaner was parked and no one else can run
delayed iputs after that, while the reclaim task is waiting for the
remaining delayed iputs to be completed.
Fix this by waiting for all ordered extents to complete and running the
delayed iputs before attempting to stop the async reclaim tasks. Note that
we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
flag to be set on an ordered extent, but the delayed iput is added after
that, when doing the final btrfs_put_ordered_extent(). So instead wait for
the work queues used for executing ordered extent completion to be empty,
which works because we do the final put on an ordered extent at
btrfs_finish_ordered_io() (while we are in the unmount context).
Fixes: d6fd0ae25c6495 ("Btrfs: fix missing delayed iputs on unmount")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-08 12:31:51 +01:00
/*
* After we parked the cleaner kthread , ordered extents may have
* completed and created new delayed iputs . If one of the async reclaim
* tasks is running and in the RUN_DELAYED_IPUTS flush state , then we
* can hang forever trying to stop it , because if a delayed iput is
* added after it ran btrfs_run_delayed_iputs ( ) and before it called
* btrfs_wait_on_delayed_iputs ( ) , it will hang forever since there is
* no one else to run iputs .
*
* So wait for all ongoing ordered extents to complete and then run
* delayed iputs . This works because once we reach this point no one
* can either create new ordered extents nor create delayed iputs
* through some other means .
*
* Also note that btrfs_wait_ordered_roots ( ) is not safe here , because
* it waits for BTRFS_ORDERED_COMPLETE to be set on an ordered extent ,
* but the delayed iput for the respective inode is made only when doing
* the final btrfs_put_ordered_extent ( ) ( which must happen at
* btrfs_finish_ordered_io ( ) when we are unmounting ) .
*/
btrfs_flush_workqueue ( fs_info - > endio_write_workers ) ;
/* Ordered extents for free space inodes. */
btrfs_flush_workqueue ( fs_info - > endio_freespace_worker ) ;
btrfs_run_delayed_iputs ( fs_info ) ;
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-13 17:29:04 -07:00
cancel_work_sync ( & fs_info - > async_reclaim_work ) ;
2020-07-21 10:22:33 -04:00
cancel_work_sync ( & fs_info - > async_data_reclaim_work ) ;
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 09:28:22 -04:00
cancel_work_sync ( & fs_info - > preempt_reclaim_work ) ;
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-13 17:29:04 -07:00
2019-12-13 16:22:14 -08:00
/* Cancel or finish ongoing discard work */
btrfs_discard_cleanup ( fs_info ) ;
2017-07-17 08:45:34 +01:00
if ( ! sb_rdonly ( fs_info - > sb ) ) {
2015-06-15 09:41:18 -04:00
/*
Btrfs: fix missing delayed iputs on unmount
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-31 10:06:08 -07:00
* The cleaner kthread is stopped , so do one final pass over
* unused block groups .
2015-06-15 09:41:18 -04:00
*/
2016-06-22 18:54:23 -04:00
btrfs_delete_unused_bgs ( fs_info ) ;
2015-06-15 09:41:18 -04:00
Btrfs: fix crash during unmount due to race with delayed inode workers
During unmount we can have a job from the delayed inode items work queue
still running, that can lead to at least two bad things:
1) A crash, because the worker can try to create a transaction just
after the fs roots were freed;
2) A transaction leak, because the worker can create a transaction
before the fs roots are freed and just after we committed the last
transaction and after we stopped the transaction kthread.
A stack trace example of the crash:
[79011.691214] kernel BUG at lib/radix-tree.c:982!
[79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2
(...)
[79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
[79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
(...)
[79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
[79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
[79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
[79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
[79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
[79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
[79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
[79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[79011.715448] Call Trace:
[79011.715925] record_root_in_trans+0x72/0xf0 [btrfs]
[79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
[79011.717925] start_transaction+0xdd/0x5c0 [btrfs]
[79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
[79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs]
[79011.720773] process_one_work+0x26d/0x6a0
[79011.721497] worker_thread+0x4f/0x3e0
[79011.722153] ? process_one_work+0x6a0/0x6a0
[79011.722901] kthread+0x103/0x140
[79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70
[79011.724379] ret_from_fork+0x3a/0x50
(...)
The following diagram shows a sequence of steps that lead to the crash
during ummount of the filesystem:
CPU 1 CPU 2 CPU 3
btrfs_punch_hole()
btrfs_btree_balance_dirty()
btrfs_balance_delayed_items()
--> sees
fs_info->delayed_root->items
with value 200, which is greater
than
BTRFS_DELAYED_BACKGROUND (128)
and smaller than
BTRFS_DELAYED_WRITEBACK (512)
btrfs_wq_run_delayed_node()
--> queues a job for
fs_info->delayed_workers to run
btrfs_async_run_delayed_root()
btrfs_async_run_delayed_root()
--> job queued by CPU 1
--> starts picking and running
delayed nodes from the
prepare_list list
close_ctree()
btrfs_delete_unused_bgs()
btrfs_commit_super()
btrfs_join_transaction()
--> gets transaction N
btrfs_commit_transaction(N)
--> set transaction state
to TRANTS_STATE_COMMIT_START
btrfs_first_prepared_delayed_node()
--> picks delayed node X through
the prepared_list list
btrfs_run_delayed_items()
btrfs_first_delayed_node()
--> also picks delayed node X
but through the node_list
list
__btrfs_commit_inode_delayed_items()
--> runs all delayed items from
this node and drops the
node's item count to 0
through call to
btrfs_release_delayed_inode()
--> finishes running any remaining
delayed nodes
--> finishes transaction commit
--> stops cleaner and transaction threads
btrfs_free_fs_roots()
--> frees all roots and removes them
from the radix tree
fs_info->fs_roots_radix
btrfs_join_transaction()
start_transaction()
btrfs_record_root_in_trans()
record_root_in_trans()
radix_tree_tag_set()
--> crashes because
the root is not in
the radix tree
anymore
If the worker is able to call btrfs_join_transaction() before the unmount
task frees the fs roots, we end up leaking a transaction and all its
resources, since after the call to btrfs_commit_super() and stopping the
transaction kthread, we don't expect to have any transaction open anymore.
When this situation happens the worker has a delayed node that has no
more items to run, since the task calling btrfs_run_delayed_items(),
which is doing a transaction commit, picks the same node and runs all
its items first.
We can not wait for the worker to complete when running delayed items
through btrfs_run_delayed_items(), because we call that function in
several phases of a transaction commit, and that could cause a deadlock
because the worker calls btrfs_join_transaction() and the task doing the
transaction commit may have already set the transaction state to
TRANS_STATE_COMMIT_DOING.
Also it's not possible to get into a situation where only some of the
items of a delayed node are added to the fs/subvolume tree in the current
transaction and the remaining ones in the next transaction, because when
running the items of a delayed inode we lock its mutex, effectively
waiting for the worker if the worker is running the items of the delayed
node already.
Since this can only cause issues when unmounting a filesystem, fix it in
a simple way by waiting for any jobs on the delayed workers queue before
calling btrfs_commit_supper() at close_ctree(). This works because at this
point no one can call btrfs_btree_balance_dirty() or
btrfs_balance_delayed_items(), and if we end up waiting for any worker to
complete, btrfs_commit_super() will commit the transaction created by the
worker.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-28 13:04:36 +00:00
/*
* There might be existing delayed inode workers still running
* and holding an empty delayed inode item . We must wait for
* them to complete first because they can create a transaction .
* This happens when someone calls btrfs_balance_delayed_items ( )
* and then a transaction commit runs the same delayed nodes
* before any delayed worker has done something with the nodes .
* We must wait for any worker here and not at transaction
* commit time since that could cause a deadlock .
* This is a very rare case .
*/
btrfs_flush_workqueue ( fs_info - > delayed_workers ) ;
2016-06-21 21:16:51 -04:00
ret = btrfs_commit_super ( fs_info ) ;
2011-01-06 19:30:25 +08:00
if ( ret )
2014-08-01 18:12:36 -05:00
btrfs_err ( fs_info , " commit super ret %d " , ret ) ;
2011-01-06 19:30:25 +08:00
}
2021-10-05 16:35:25 -04:00
if ( BTRFS_FS_ERROR ( fs_info ) )
2016-06-22 18:54:24 -04:00
btrfs_error_commit_super ( fs_info ) ;
2007-04-09 10:42:37 -04:00
2011-11-17 00:56:18 -05:00
kthread_stop ( fs_info - > transaction_kthread ) ;
kthread_stop ( fs_info - > cleaner_kthread ) ;
2010-05-16 10:49:58 -04:00
2018-09-28 07:18:03 -04:00
ASSERT ( list_empty ( & fs_info - > delayed_iputs ) ) ;
2016-09-02 15:40:02 -04:00
set_bit ( BTRFS_FS_CLOSING_DONE , & fs_info - > flags ) ;
2009-07-28 08:41:57 -04:00
2020-06-10 09:04:44 +08:00
if ( btrfs_check_quota_leak ( fs_info ) ) {
WARN_ON ( IS_ENABLED ( CONFIG_BTRFS_DEBUG ) ) ;
btrfs_err ( fs_info , " qgroup reserved space leaked " ) ;
}
2014-08-01 18:12:36 -05:00
btrfs_free_qgroup_config ( fs_info ) ;
2018-04-27 12:21:53 +03:00
ASSERT ( list_empty ( & fs_info - > delalloc_roots ) ) ;
2011-09-13 15:23:30 +02:00
2013-01-29 10:10:51 +00:00
if ( percpu_counter_sum ( & fs_info - > delalloc_bytes ) ) {
2014-08-01 18:12:36 -05:00
btrfs_info ( fs_info , " at unmount delalloc count %lld " ,
2013-01-29 10:10:51 +00:00
percpu_counter_sum ( & fs_info - > delalloc_bytes ) ) ;
2008-01-31 11:05:37 -05:00
}
2008-07-30 16:29:20 -04:00
2020-10-09 09:28:20 -04:00
if ( percpu_counter_sum ( & fs_info - > ordered_bytes ) )
2019-04-10 15:56:09 -04:00
btrfs_info ( fs_info , " at unmount dio bytes count %lld " ,
2020-10-09 09:28:20 -04:00
percpu_counter_sum ( & fs_info - > ordered_bytes ) ) ;
2019-04-10 15:56:09 -04:00
2015-08-14 18:32:47 +08:00
btrfs_sysfs_remove_mounted ( fs_info ) ;
2015-03-10 06:38:38 +08:00
btrfs_sysfs_remove_fsid ( fs_info - > fs_devices ) ;
2013-11-01 13:06:58 -04:00
2014-01-13 19:53:53 +08:00
btrfs_put_block_group_cache ( fs_info ) ;
2014-04-09 19:23:22 +08:00
/*
* we must make sure there is not any read request to
* submit after we stopping all workers .
*/
invalidate_inode_pages2 ( fs_info - > btree_inode - > i_mapping ) ;
2013-10-16 13:53:28 -04:00
btrfs_stop_all_workers ( fs_info ) ;
2020-12-14 10:10:48 +00:00
/* We shouldn't have any transaction open at this point */
2021-12-16 19:47:36 +08:00
warn_about_uncommitted_trans ( fs_info ) ;
2020-12-14 10:10:48 +00:00
2016-09-02 15:40:02 -04:00
clear_bit ( BTRFS_FS_OPEN , & fs_info - > flags ) ;
2019-10-10 10:39:25 +08:00
free_root_pointers ( fs_info , true ) ;
2020-02-14 16:11:42 -05:00
btrfs_free_fs_roots ( fs_info ) ;
2008-04-18 16:11:30 -04:00
2020-01-21 09:17:06 -05:00
/*
* We must free the block groups after dropping the fs_roots as we could
* have had an IO error and have left over tree log blocks that aren ' t
* cleaned up until the fs roots are freed . This makes the block group
* accounting appear to be wrong because there ' s pending reserved bytes ,
* so make sure we do the block group cleanup afterwards .
*/
btrfs_free_block_groups ( fs_info ) ;
2013-05-30 16:55:44 -04:00
iput ( fs_info - > btree_inode ) ;
2008-04-30 13:59:35 -04:00
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
btrfs_mapping_tree_free ( fs_info ) ;
2019-02-12 16:13:14 +02:00
btrfs_close_devices ( fs_info - > fs_devices ) ;
2007-02-02 09:18:22 -05:00
}
2023-09-12 13:04:29 +01:00
void btrfs_mark_buffer_dirty ( struct btrfs_trans_handle * trans ,
struct extent_buffer * buf )
2007-10-15 16:14:19 -04:00
{
2020-11-03 21:30:46 +08:00
struct btrfs_fs_info * fs_info = buf - > fs_info ;
2007-10-15 16:14:19 -04:00
u64 transid = btrfs_header_generation ( buf ) ;
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
2013-09-19 16:07:01 -04:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
/*
* This is a fast path so only do this check if we have sanity tests
2018-11-28 12:05:13 +01:00
* enabled . Normal people shouldn ' t be using unmapped buffers as dirty
2013-09-19 16:07:01 -04:00
* outside of the sanity tests .
*/
2018-06-27 16:38:24 +03:00
if ( unlikely ( test_bit ( EXTENT_BUFFER_UNMAPPED , & buf - > bflags ) ) )
2013-09-19 16:07:01 -04:00
return ;
# endif
2023-09-12 13:04:29 +01:00
/* This is an active transaction (its state < TRANS_STATE_UNBLOCKED). */
ASSERT ( trans - > transid = = fs_info - > generation ) ;
2021-09-22 10:36:45 +01:00
btrfs_assert_tree_write_locked ( buf ) ;
2023-09-12 13:04:31 +01:00
if ( unlikely ( transid ! = fs_info - > generation ) ) {
2023-09-12 13:04:29 +01:00
btrfs_abort_transaction ( trans , - EUCLEAN ) ;
2023-09-12 13:04:30 +01:00
btrfs_crit ( fs_info ,
" dirty buffer transid mismatch, logical %llu found transid %llu running transid %llu " ,
buf - > start , transid , fs_info - > generation ) ;
2023-09-12 13:04:29 +01:00
}
2023-05-08 07:58:38 -07:00
set_extent_buffer_dirty ( buf ) ;
2007-02-02 09:18:22 -05:00
}
2016-06-22 18:54:24 -04:00
static void __btrfs_btree_balance_dirty ( struct btrfs_fs_info * fs_info ,
2012-11-14 14:34:34 +00:00
int flush_delayed )
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
{
/*
* looks as though older kernels can get into trouble with
* this code , they end up stuck in balance_dirty_pages forever
*/
2013-01-29 10:09:20 +00:00
int ret ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
if ( current - > flags & PF_MEMALLOC )
return ;
2012-11-14 14:34:34 +00:00
if ( flush_delayed )
2016-06-22 18:54:24 -04:00
btrfs_balance_delayed_items ( fs_info ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
2018-07-02 15:44:58 +08:00
ret = __percpu_counter_compare ( & fs_info - > dirty_metadata_bytes ,
BTRFS_DIRTY_METADATA_THRESH ,
fs_info - > dirty_metadata_batch ) ;
2013-01-29 10:09:20 +00:00
if ( ret > 0 ) {
2016-06-22 18:54:23 -04:00
balance_dirty_pages_ratelimited ( fs_info - > btree_inode - > i_mapping ) ;
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
}
}
2016-06-22 18:54:24 -04:00
void btrfs_btree_balance_dirty ( struct btrfs_fs_info * fs_info )
2007-05-02 15:53:43 -04:00
{
2016-06-22 18:54:24 -04:00
__btrfs_btree_balance_dirty ( fs_info , 1 ) ;
2012-11-14 14:34:34 +00:00
}
2009-05-18 10:41:58 -04:00
2016-06-22 18:54:24 -04:00
void btrfs_btree_balance_dirty_nodelay ( struct btrfs_fs_info * fs_info )
2012-11-14 14:34:34 +00:00
{
2016-06-22 18:54:24 -04:00
__btrfs_btree_balance_dirty ( fs_info , 0 ) ;
2007-05-02 15:53:43 -04:00
}
2007-10-15 16:17:34 -04:00
2016-06-22 18:54:24 -04:00
static void btrfs_error_commit_super ( struct btrfs_fs_info * fs_info )
2011-01-06 19:30:25 +08:00
{
2018-04-27 12:21:53 +03:00
/* cleanup FS via transaction */
btrfs_cleanup_transaction ( fs_info ) ;
2016-06-22 18:54:23 -04:00
mutex_lock ( & fs_info - > cleaner_mutex ) ;
2016-06-22 18:54:24 -04:00
btrfs_run_delayed_iputs ( fs_info ) ;
2016-06-22 18:54:23 -04:00
mutex_unlock ( & fs_info - > cleaner_mutex ) ;
2011-01-06 19:30:25 +08:00
2016-06-22 18:54:23 -04:00
down_write ( & fs_info - > cleanup_work_sem ) ;
up_write ( & fs_info - > cleanup_work_sem ) ;
2011-01-06 19:30:25 +08:00
}
2020-03-24 10:47:52 -04:00
static void btrfs_drop_all_logs ( struct btrfs_fs_info * fs_info )
{
2022-07-15 13:59:21 +02:00
struct btrfs_root * gang [ 8 ] ;
u64 root_objectid = 0 ;
int ret ;
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
while ( ( ret = radix_tree_gang_lookup ( & fs_info - > fs_roots_radix ,
( void * * ) gang , root_objectid ,
ARRAY_SIZE ( gang ) ) ) ! = 0 ) {
int i ;
for ( i = 0 ; i < ret ; i + + )
gang [ i ] = btrfs_grab_root ( gang [ i ] ) ;
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
for ( i = 0 ; i < ret ; i + + ) {
if ( ! gang [ i ] )
2020-03-24 10:47:52 -04:00
continue ;
2024-04-15 16:16:23 -04:00
root_objectid = btrfs_root_id ( gang [ i ] ) ;
2022-07-15 13:59:21 +02:00
btrfs_free_log ( NULL , gang [ i ] ) ;
btrfs_put_root ( gang [ i ] ) ;
2020-03-24 10:47:52 -04:00
}
2022-07-15 13:59:21 +02:00
root_objectid + + ;
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
2020-03-24 10:47:52 -04:00
}
2022-07-15 13:59:21 +02:00
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
2020-03-24 10:47:52 -04:00
btrfs_free_log_root_tree ( NULL , fs_info ) ;
}
2012-03-01 14:56:26 +01:00
static void btrfs_destroy_ordered_extents ( struct btrfs_root * root )
2011-01-06 19:30:25 +08:00
{
struct btrfs_ordered_extent * ordered ;
2013-05-15 07:48:23 +00:00
spin_lock ( & root - > ordered_extent_lock ) ;
2013-01-31 14:30:08 -05:00
/*
* This will just short circuit the ordered completion stuff which will
* make sure the ordered extent gets properly cleaned up .
*/
2013-05-15 07:48:23 +00:00
list_for_each_entry ( ordered , & root - > ordered_extents ,
2013-01-31 14:30:08 -05:00
root_extent_list )
set_bit ( BTRFS_ORDERED_IOERR , & ordered - > flags ) ;
2013-05-15 07:48:23 +00:00
spin_unlock ( & root - > ordered_extent_lock ) ;
}
static void btrfs_destroy_all_ordered_extents ( struct btrfs_fs_info * fs_info )
{
struct btrfs_root * root ;
2023-08-10 11:00:22 +08:00
LIST_HEAD ( splice ) ;
2013-05-15 07:48:23 +00:00
spin_lock ( & fs_info - > ordered_root_lock ) ;
list_splice_init ( & fs_info - > ordered_roots , & splice ) ;
while ( ! list_empty ( & splice ) ) {
root = list_first_entry ( & splice , struct btrfs_root ,
ordered_root ) ;
2013-09-27 16:36:02 -04:00
list_move_tail ( & root - > ordered_root ,
& fs_info - > ordered_roots ) ;
2013-05-15 07:48:23 +00:00
2014-02-10 17:07:16 +08:00
spin_unlock ( & fs_info - > ordered_root_lock ) ;
2013-05-15 07:48:23 +00:00
btrfs_destroy_ordered_extents ( root ) ;
2014-02-10 17:07:16 +08:00
cond_resched ( ) ;
spin_lock ( & fs_info - > ordered_root_lock ) ;
2013-05-15 07:48:23 +00:00
}
spin_unlock ( & fs_info - > ordered_root_lock ) ;
2018-11-21 14:05:45 -05:00
/*
* We need this here because if we ' ve been flipped read - only we won ' t
* get sync ( ) from the umount , so we need to make sure any ordered
* extents that haven ' t had their dirty pages IO start writeout yet
* actually get run and error out properly .
*/
2024-05-14 16:48:12 +02:00
btrfs_wait_ordered_roots ( fs_info , U64_MAX , NULL ) ;
2011-01-06 19:30:25 +08:00
}
2023-06-02 12:19:42 +01:00
static void btrfs_destroy_delayed_refs ( struct btrfs_transaction * trans ,
struct btrfs_fs_info * fs_info )
2011-01-06 19:30:25 +08:00
{
struct rb_node * node ;
2024-06-03 12:49:08 +01:00
struct btrfs_delayed_ref_root * delayed_refs = & trans - > delayed_refs ;
2011-01-06 19:30:25 +08:00
struct btrfs_delayed_ref_node * ref ;
spin_lock ( & delayed_refs - > lock ) ;
2018-08-23 03:51:49 +08:00
while ( ( node = rb_first_cached ( & delayed_refs - > href_root ) ) ! = NULL ) {
2014-01-23 09:21:38 -05:00
struct btrfs_delayed_ref_head * head ;
2017-10-19 14:16:00 -04:00
struct rb_node * n ;
2013-06-03 16:42:36 -04:00
bool pin_bytes = false ;
2011-01-06 19:30:25 +08:00
2014-01-23 09:21:38 -05:00
head = rb_entry ( node , struct btrfs_delayed_ref_head ,
href_node ) ;
2018-11-21 14:05:39 -05:00
if ( btrfs_delayed_ref_lock ( delayed_refs , head ) )
2014-01-23 09:21:38 -05:00
continue ;
2018-11-21 14:05:39 -05:00
2014-01-23 09:21:38 -05:00
spin_lock ( & head - > lock ) ;
2018-08-23 03:51:50 +08:00
while ( ( n = rb_first_cached ( & head - > ref_tree ) ) ! = NULL ) {
2017-10-19 14:16:00 -04:00
ref = rb_entry ( n , struct btrfs_delayed_ref_node ,
ref_node ) ;
2018-08-23 03:51:50 +08:00
rb_erase_cached ( & ref - > ref_node , & head - > ref_tree ) ;
2017-10-19 14:16:00 -04:00
RB_CLEAR_NODE ( & ref - > ref_node ) ;
btrfs: improve delayed refs iterations
This issue was found when I tried to delete a heavily reflinked file,
when deleting such files, other transaction operation will not have a
chance to make progress, for example, start_transaction() will blocked
in wait_current_trans(root) for long time, sometimes it even triggers
soft lockups, and the time taken to delete such heavily reflinked file
is also very large, often hundreds of seconds. Using perf top, it reports
that:
PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
---------------------------------------------------------------------------------------
84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80
11.02% [kernel] [k] delay_tsc
0.79% [kernel] [k] _raw_spin_unlock_irq
0.78% [kernel] [k] _raw_spin_unlock_irqrestore
0.45% [kernel] [k] do_raw_spin_lock
0.18% [kernel] [k] __slab_alloc
It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
work, I found it's select_delayed_ref() causing this issue, for a delayed
head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
select_delayed_ref() will firstly try to iterate node list to find
BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
waste much time.
To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
then in select_delayed_ref(), if this list is not empty, we can directly use
nodes in this list. With this patch, it just took about 10~15 seconds to
delte the same file. Now using perf top, it reports that:
PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
----------------------------------------------------------------------------------------
20.74% [kernel] [k] _raw_spin_unlock_irqrestore
16.33% [kernel] [k] __slab_alloc
5.41% [kernel] [k] lock_acquired
4.42% [kernel] [k] lock_acquire
4.05% [kernel] [k] lock_release
3.37% [kernel] [k] _raw_spin_unlock_irq
For normal files, this patch also gives help, at least we do not need to
iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-26 18:07:33 +08:00
if ( ! list_empty ( & ref - > add_list ) )
list_del ( & ref - > add_list ) ;
2014-01-23 09:21:38 -05:00
atomic_dec ( & delayed_refs - > num_entries ) ;
btrfs_put_delayed_ref ( ref ) ;
btrfs: stop doing excessive space reservation for csum deletion
Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:
1) It uses btrfs_calc_insert_metadata_size() which is excessive because
we only need to delete csum items from the csum tree, we don't need
to insert any items, so btrfs_calc_metadata_size() is all we need (as
it computes space needed to delete an item);
2) If the free space tree is enabled, it doubles the amount of space,
which is pointless for csum deletion since we don't need to touch the
free space tree or any other tree other than the csum tree.
So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.
For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/test
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 100G gives at least 983040 extents with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar
# Flush all delalloc and clear all metadata from memory.
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
rm -f $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "rm took $dur milliseconds"
umount $MNT
Before this change rm took: 7504 milliseconds
After this change rm took: 6574 milliseconds (-12.4%)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 18:20:37 +01:00
btrfs_delayed_refs_rsv_release ( fs_info , 1 , 0 ) ;
2013-06-03 16:42:36 -04:00
}
2014-01-23 09:21:38 -05:00
if ( head - > must_insert_reserved )
pin_bytes = true ;
btrfs_free_delayed_extent_op ( head - > extent_op ) ;
2018-11-21 14:05:40 -05:00
btrfs_delete_ref_head ( delayed_refs , head ) ;
2014-01-23 09:21:38 -05:00
spin_unlock ( & head - > lock ) ;
spin_unlock ( & delayed_refs - > lock ) ;
mutex_unlock ( & head - > mutex ) ;
2011-01-06 19:30:25 +08:00
2020-01-20 16:09:08 +02:00
if ( pin_bytes ) {
struct btrfs_block_group * cache ;
cache = btrfs_lookup_block_group ( fs_info , head - > bytenr ) ;
BUG_ON ( ! cache ) ;
spin_lock ( & cache - > space_info - > lock ) ;
spin_lock ( & cache - > lock ) ;
cache - > pinned + = head - > num_bytes ;
btrfs_space_info_update_bytes_pinned ( fs_info ,
cache - > space_info , head - > num_bytes ) ;
cache - > reserved - = head - > num_bytes ;
cache - > space_info - > bytes_reserved - = head - > num_bytes ;
spin_unlock ( & cache - > lock ) ;
spin_unlock ( & cache - > space_info - > lock ) ;
btrfs_put_block_group ( cache ) ;
btrfs_error_unpin_extent_range ( fs_info , head - > bytenr ,
head - > bytenr + head - > num_bytes - 1 ) ;
}
2018-11-21 14:05:41 -05:00
btrfs_cleanup_ref_head_accounting ( fs_info , delayed_refs , head ) ;
2017-09-29 15:43:57 -04:00
btrfs_put_delayed_ref_head ( head ) ;
2011-01-06 19:30:25 +08:00
cond_resched ( ) ;
spin_lock ( & delayed_refs - > lock ) ;
}
2020-02-11 15:25:37 +08:00
btrfs_qgroup_destroy_extent_records ( trans ) ;
2011-01-06 19:30:25 +08:00
spin_unlock ( & delayed_refs - > lock ) ;
}
2012-03-01 14:56:26 +01:00
static void btrfs_destroy_delalloc_inodes ( struct btrfs_root * root )
2011-01-06 19:30:25 +08:00
{
struct btrfs_inode * btrfs_inode ;
2023-08-10 11:00:22 +08:00
LIST_HEAD ( splice ) ;
2011-01-06 19:30:25 +08:00
2013-05-15 07:48:22 +00:00
spin_lock ( & root - > delalloc_lock ) ;
list_splice_init ( & root - > delalloc_inodes , & splice ) ;
2011-01-06 19:30:25 +08:00
while ( ! list_empty ( & splice ) ) {
2018-04-27 12:21:53 +03:00
struct inode * inode = NULL ;
2013-05-15 07:48:22 +00:00
btrfs_inode = list_first_entry ( & splice , struct btrfs_inode ,
delalloc_inodes ) ;
2024-02-22 09:56:17 +01:00
btrfs_del_delalloc_inode ( btrfs_inode ) ;
2013-05-15 07:48:22 +00:00
spin_unlock ( & root - > delalloc_lock ) ;
2011-01-06 19:30:25 +08:00
2018-04-27 12:21:53 +03:00
/*
* Make sure we get a live inode and that it ' ll not disappear
* meanwhile .
*/
inode = igrab ( & btrfs_inode - > vfs_inode ) ;
if ( inode ) {
2023-05-11 12:45:59 -04:00
unsigned int nofs_flag ;
nofs_flag = memalloc_nofs_save ( ) ;
2018-04-27 12:21:53 +03:00
invalidate_inode_pages2 ( inode - > i_mapping ) ;
2023-05-11 12:45:59 -04:00
memalloc_nofs_restore ( nofs_flag ) ;
2018-04-27 12:21:53 +03:00
iput ( inode ) ;
}
2013-05-15 07:48:22 +00:00
spin_lock ( & root - > delalloc_lock ) ;
2011-01-06 19:30:25 +08:00
}
2013-05-15 07:48:22 +00:00
spin_unlock ( & root - > delalloc_lock ) ;
}
static void btrfs_destroy_all_delalloc_inodes ( struct btrfs_fs_info * fs_info )
{
struct btrfs_root * root ;
2023-08-10 11:00:22 +08:00
LIST_HEAD ( splice ) ;
2013-05-15 07:48:22 +00:00
spin_lock ( & fs_info - > delalloc_root_lock ) ;
list_splice_init ( & fs_info - > delalloc_roots , & splice ) ;
while ( ! list_empty ( & splice ) ) {
root = list_first_entry ( & splice , struct btrfs_root ,
delalloc_root ) ;
2020-01-24 09:33:01 -05:00
root = btrfs_grab_root ( root ) ;
2013-05-15 07:48:22 +00:00
BUG_ON ( ! root ) ;
spin_unlock ( & fs_info - > delalloc_root_lock ) ;
btrfs_destroy_delalloc_inodes ( root ) ;
2020-01-24 09:33:01 -05:00
btrfs_put_root ( root ) ;
2013-05-15 07:48:22 +00:00
spin_lock ( & fs_info - > delalloc_root_lock ) ;
}
spin_unlock ( & fs_info - > delalloc_root_lock ) ;
2011-01-06 19:30:25 +08:00
}
2023-06-30 16:03:47 +01:00
static void btrfs_destroy_marked_extents ( struct btrfs_fs_info * fs_info ,
struct extent_io_tree * dirty_pages ,
int mark )
2011-01-06 19:30:25 +08:00
{
struct extent_buffer * eb ;
u64 start = 0 ;
u64 end ;
2023-06-30 16:03:49 +01:00
while ( find_first_extent_bit ( dirty_pages , start , & start , & end ,
mark , NULL ) ) {
2016-04-26 23:54:39 +02:00
clear_extent_bits ( dirty_pages , start , end , mark ) ;
2011-01-06 19:30:25 +08:00
while ( start < = end ) {
2016-06-22 18:54:23 -04:00
eb = find_extent_buffer ( fs_info , start ) ;
start + = fs_info - > nodesize ;
2013-04-24 16:41:19 -04:00
if ( ! eb )
2011-01-06 19:30:25 +08:00
continue ;
2023-01-26 16:00:56 -05:00
btrfs_tree_lock ( eb ) ;
2013-04-24 16:41:19 -04:00
wait_on_extent_buffer_writeback ( eb ) ;
2023-01-26 16:00:58 -05:00
btrfs_clear_buffer_dirty ( NULL , eb ) ;
2023-01-26 16:00:56 -05:00
btrfs_tree_unlock ( eb ) ;
2011-01-06 19:30:25 +08:00
2013-04-24 16:41:19 -04:00
free_extent_buffer_stale ( eb ) ;
2011-01-06 19:30:25 +08:00
}
}
}
2023-06-30 16:03:48 +01:00
static void btrfs_destroy_pinned_extent ( struct btrfs_fs_info * fs_info ,
struct extent_io_tree * unpin )
2011-01-06 19:30:25 +08:00
{
u64 start ;
u64 end ;
while ( 1 ) {
2018-11-16 13:04:44 +00:00
struct extent_state * cached_state = NULL ;
btrfs: fix pinned underflow after transaction aborted
When running generic/475, we may get the following warning in dmesg:
[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G W O 4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS: 00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939] close_ctree+0x171/0x330 [btrfs]
[ 6902.140181] ? kthread_stop+0x146/0x1f0
[ 6902.141277] generic_shutdown_super+0x6c/0x100
[ 6902.142517] kill_anon_super+0x14/0x30
[ 6902.143554] btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790] deactivate_locked_super+0x2f/0x70
[ 6902.146014] cleanup_mnt+0x3b/0x70
[ 6902.147020] task_work_run+0x9e/0xd0
[ 6902.148036] do_syscall_64+0x470/0x600
[ 6902.149142] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.
When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
Fixes: d4b450cd4b33 ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-24 20:24:03 +08:00
/*
* The btrfs_finish_extent_commit ( ) may get the same range as
* ours between find_first_extent_bit and clear_extent_dirty .
* Hence , hold the unused_bg_unpin_mutex to avoid double unpin
* the same extent range .
*/
mutex_lock ( & fs_info - > unused_bg_unpin_mutex ) ;
2023-06-30 16:03:49 +01:00
if ( ! find_first_extent_bit ( unpin , 0 , & start , & end ,
EXTENT_DIRTY , & cached_state ) ) {
btrfs: fix pinned underflow after transaction aborted
When running generic/475, we may get the following warning in dmesg:
[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G W O 4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS: 00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939] close_ctree+0x171/0x330 [btrfs]
[ 6902.140181] ? kthread_stop+0x146/0x1f0
[ 6902.141277] generic_shutdown_super+0x6c/0x100
[ 6902.142517] kill_anon_super+0x14/0x30
[ 6902.143554] btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790] deactivate_locked_super+0x2f/0x70
[ 6902.146014] cleanup_mnt+0x3b/0x70
[ 6902.147020] task_work_run+0x9e/0xd0
[ 6902.148036] do_syscall_64+0x470/0x600
[ 6902.149142] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.
When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
Fixes: d4b450cd4b33 ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-24 20:24:03 +08:00
mutex_unlock ( & fs_info - > unused_bg_unpin_mutex ) ;
2011-01-06 19:30:25 +08:00
break ;
btrfs: fix pinned underflow after transaction aborted
When running generic/475, we may get the following warning in dmesg:
[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G W O 4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS: 00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939] close_ctree+0x171/0x330 [btrfs]
[ 6902.140181] ? kthread_stop+0x146/0x1f0
[ 6902.141277] generic_shutdown_super+0x6c/0x100
[ 6902.142517] kill_anon_super+0x14/0x30
[ 6902.143554] btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790] deactivate_locked_super+0x2f/0x70
[ 6902.146014] cleanup_mnt+0x3b/0x70
[ 6902.147020] task_work_run+0x9e/0xd0
[ 6902.148036] do_syscall_64+0x470/0x600
[ 6902.149142] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.
When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
Fixes: d4b450cd4b33 ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-24 20:24:03 +08:00
}
2011-01-06 19:30:25 +08:00
2018-11-16 13:04:44 +00:00
clear_extent_dirty ( unpin , start , end , & cached_state ) ;
free_extent_state ( cached_state ) ;
2016-06-22 18:54:24 -04:00
btrfs_error_unpin_extent_range ( fs_info , start , end ) ;
btrfs: fix pinned underflow after transaction aborted
When running generic/475, we may get the following warning in dmesg:
[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G W O 4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS: 00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939] close_ctree+0x171/0x330 [btrfs]
[ 6902.140181] ? kthread_stop+0x146/0x1f0
[ 6902.141277] generic_shutdown_super+0x6c/0x100
[ 6902.142517] kill_anon_super+0x14/0x30
[ 6902.143554] btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790] deactivate_locked_super+0x2f/0x70
[ 6902.146014] cleanup_mnt+0x3b/0x70
[ 6902.147020] task_work_run+0x9e/0xd0
[ 6902.148036] do_syscall_64+0x470/0x600
[ 6902.149142] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.
When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
Fixes: d4b450cd4b33 ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-24 20:24:03 +08:00
mutex_unlock ( & fs_info - > unused_bg_unpin_mutex ) ;
2011-01-06 19:30:25 +08:00
cond_resched ( ) ;
}
}
2019-10-29 19:20:18 +01:00
static void btrfs_cleanup_bg_io ( struct btrfs_block_group * cache )
2016-07-20 17:44:12 -07:00
{
struct inode * inode ;
inode = cache - > io_ctl . inode ;
if ( inode ) {
2023-05-11 12:45:59 -04:00
unsigned int nofs_flag ;
nofs_flag = memalloc_nofs_save ( ) ;
2016-07-20 17:44:12 -07:00
invalidate_inode_pages2 ( inode - > i_mapping ) ;
2023-05-11 12:45:59 -04:00
memalloc_nofs_restore ( nofs_flag ) ;
2016-07-20 17:44:12 -07:00
BTRFS_I ( inode ) - > generation = 0 ;
cache - > io_ctl . inode = NULL ;
iput ( inode ) ;
}
btrfs: fix space cache memory leak after transaction abort
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:
1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
and it's about to start writing out dirty extent buffers;
2) Transaction N + 1 already started and another task, task A, just called
btrfs_commit_transaction() on it;
3) Block group B was dirtied (extents allocated from it) by transaction
N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
very beginning of the transaction commit, it starts writeback for the
block group's space cache by calling btrfs_write_out_cache(), which
allocates the pages array for the block group's io_ctl with a call to
io_ctl_init(). Block group A is added to the io_list of transaction
N + 1 by btrfs_start_dirty_block_groups();
4) While transaction N's commit is writing out the extent buffers, it gets
an IO error and aborts transaction N, also setting the file system to
RO mode;
5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
btrfs_commit_transaction() and has set transaction N + 1 state to
TRANS_STATE_COMMIT_START. Immediately after that it checks that the
filesystem was turned to RO mode, due to transaction N's abort, and
jumps to the "cleanup_transaction" label. After that we end up at
btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
That helper finds block group B in the transaction's io_list but it
never releases the pages array of the block group's io_ctl, resulting in
a memory leak.
In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().
We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.
So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().
This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:
unreferenced object 0xffff9bbf009fa600 (size 512):
comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
hex dump (first 32 bytes):
00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=...
80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=...
backtrace:
[<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
[<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
[<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
[<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
[<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
[<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
[<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
[<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
[<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
[<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
[<0000000043ace2c9>] do_syscall_64+0x33/0x80
[<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-14 11:04:09 +01:00
ASSERT ( cache - > io_ctl . pages = = NULL ) ;
2016-07-20 17:44:12 -07:00
btrfs_put_block_group ( cache ) ;
}
void btrfs_cleanup_dirty_bgs ( struct btrfs_transaction * cur_trans ,
2016-06-22 18:54:24 -04:00
struct btrfs_fs_info * fs_info )
2016-07-20 17:44:12 -07:00
{
2019-10-29 19:20:18 +01:00
struct btrfs_block_group * cache ;
2016-07-20 17:44:12 -07:00
spin_lock ( & cur_trans - > dirty_bgs_lock ) ;
while ( ! list_empty ( & cur_trans - > dirty_bgs ) ) {
cache = list_first_entry ( & cur_trans - > dirty_bgs ,
2019-10-29 19:20:18 +01:00
struct btrfs_block_group ,
2016-07-20 17:44:12 -07:00
dirty_list ) ;
if ( ! list_empty ( & cache - > io_list ) ) {
spin_unlock ( & cur_trans - > dirty_bgs_lock ) ;
list_del_init ( & cache - > io_list ) ;
btrfs_cleanup_bg_io ( cache ) ;
spin_lock ( & cur_trans - > dirty_bgs_lock ) ;
}
list_del_init ( & cache - > dirty_list ) ;
spin_lock ( & cache - > lock ) ;
cache - > disk_cache_state = BTRFS_DC_ERROR ;
spin_unlock ( & cache - > lock ) ;
spin_unlock ( & cur_trans - > dirty_bgs_lock ) ;
btrfs_put_block_group ( cache ) ;
2023-09-28 11:12:49 +01:00
btrfs_dec_delayed_refs_rsv_bg_updates ( fs_info ) ;
2016-07-20 17:44:12 -07:00
spin_lock ( & cur_trans - > dirty_bgs_lock ) ;
}
spin_unlock ( & cur_trans - > dirty_bgs_lock ) ;
2018-02-08 18:25:18 +02:00
/*
* Refer to the definition of io_bgs member for details why it ' s safe
* to use it without any locking
*/
2016-07-20 17:44:12 -07:00
while ( ! list_empty ( & cur_trans - > io_bgs ) ) {
cache = list_first_entry ( & cur_trans - > io_bgs ,
2019-10-29 19:20:18 +01:00
struct btrfs_block_group ,
2016-07-20 17:44:12 -07:00
io_list ) ;
list_del_init ( & cache - > io_list ) ;
spin_lock ( & cache - > lock ) ;
cache - > disk_cache_state = BTRFS_DC_ERROR ;
spin_unlock ( & cache - > lock ) ;
btrfs_cleanup_bg_io ( cache ) ;
}
}
2023-12-01 13:00:11 -08:00
static void btrfs_free_all_qgroup_pertrans ( struct btrfs_fs_info * fs_info )
{
struct btrfs_root * gang [ 8 ] ;
int i ;
int ret ;
spin_lock ( & fs_info - > fs_roots_radix_lock ) ;
while ( 1 ) {
ret = radix_tree_gang_lookup_tag ( & fs_info - > fs_roots_radix ,
( void * * ) gang , 0 ,
ARRAY_SIZE ( gang ) ,
BTRFS_ROOT_TRANS_TAG ) ;
if ( ret = = 0 )
break ;
for ( i = 0 ; i < ret ; i + + ) {
struct btrfs_root * root = gang [ i ] ;
btrfs_qgroup_free_meta_all_pertrans ( root ) ;
radix_tree_tag_clear ( & fs_info - > fs_roots_radix ,
2024-04-15 16:16:23 -04:00
( unsigned long ) btrfs_root_id ( root ) ,
2023-12-01 13:00:11 -08:00
BTRFS_ROOT_TRANS_TAG ) ;
}
}
spin_unlock ( & fs_info - > fs_roots_radix_lock ) ;
}
2012-03-01 17:24:58 +01:00
void btrfs_cleanup_one_transaction ( struct btrfs_transaction * cur_trans ,
2016-06-22 18:54:24 -04:00
struct btrfs_fs_info * fs_info )
2012-03-01 17:24:58 +01:00
{
2019-03-25 14:31:22 +02:00
struct btrfs_device * dev , * tmp ;
2016-06-22 18:54:24 -04:00
btrfs_cleanup_dirty_bgs ( cur_trans , fs_info ) ;
2016-07-20 17:44:12 -07:00
ASSERT ( list_empty ( & cur_trans - > dirty_bgs ) ) ;
ASSERT ( list_empty ( & cur_trans - > io_bgs ) ) ;
2019-03-25 14:31:22 +02:00
list_for_each_entry_safe ( dev , tmp , & cur_trans - > dev_update_list ,
post_commit_list ) {
list_del_init ( & dev - > post_commit_list ) ;
}
2016-06-22 18:54:24 -04:00
btrfs_destroy_delayed_refs ( cur_trans , fs_info ) ;
2012-03-01 17:24:58 +01:00
Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 03:53:43 +00:00
cur_trans - > state = TRANS_STATE_COMMIT_START ;
2016-06-22 18:54:23 -04:00
wake_up ( & fs_info - > transaction_blocked_wait ) ;
2012-03-01 17:24:58 +01:00
Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 03:53:43 +00:00
cur_trans - > state = TRANS_STATE_UNBLOCKED ;
2016-06-22 18:54:23 -04:00
wake_up ( & fs_info - > transaction_wait ) ;
2012-03-01 17:24:58 +01:00
2016-06-22 18:54:24 -04:00
btrfs_destroy_marked_extents ( fs_info , & cur_trans - > dirty_pages ,
2012-03-01 17:24:58 +01:00
EXTENT_DIRTY ) ;
2020-01-20 16:09:18 +02:00
btrfs_destroy_pinned_extent ( fs_info , & cur_trans - > pinned_extents ) ;
2012-03-01 17:24:58 +01:00
Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 03:53:43 +00:00
cur_trans - > state = TRANS_STATE_COMPLETED ;
wake_up ( & cur_trans - > commit_wait ) ;
2012-03-01 17:24:58 +01:00
}
2016-06-22 18:54:24 -04:00
static int btrfs_cleanup_transaction ( struct btrfs_fs_info * fs_info )
2011-01-06 19:30:25 +08:00
{
struct btrfs_transaction * t ;
2016-06-22 18:54:23 -04:00
mutex_lock ( & fs_info - > transaction_kthread_mutex ) ;
2011-01-06 19:30:25 +08:00
2016-06-22 18:54:23 -04:00
spin_lock ( & fs_info - > trans_lock ) ;
while ( ! list_empty ( & fs_info - > trans_list ) ) {
t = list_first_entry ( & fs_info - > trans_list ,
2013-09-30 11:36:38 -04:00
struct btrfs_transaction , list ) ;
2023-08-24 16:59:22 -04:00
if ( t - > state > = TRANS_STATE_COMMIT_PREP ) {
2017-03-03 10:55:11 +02:00
refcount_inc ( & t - > use_count ) ;
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2016-06-22 18:54:24 -04:00
btrfs_wait_for_commit ( fs_info , t - > transid ) ;
2013-09-30 11:36:38 -04:00
btrfs_put_transaction ( t ) ;
2016-06-22 18:54:23 -04:00
spin_lock ( & fs_info - > trans_lock ) ;
2013-09-30 11:36:38 -04:00
continue ;
}
2016-06-22 18:54:23 -04:00
if ( t = = fs_info - > running_transaction ) {
2013-09-30 11:36:38 -04:00
t - > state = TRANS_STATE_COMMIT_DOING ;
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2013-09-30 11:36:38 -04:00
/*
* We wait for 0 num_writers since we don ' t hold a trans
* handle open currently for this transaction .
*/
wait_event ( t - > writer_wait ,
atomic_read ( & t - > num_writers ) = = 0 ) ;
} else {
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2013-09-30 11:36:38 -04:00
}
2016-06-22 18:54:24 -04:00
btrfs_cleanup_one_transaction ( t , fs_info ) ;
Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 03:53:43 +00:00
2016-06-22 18:54:23 -04:00
spin_lock ( & fs_info - > trans_lock ) ;
if ( t = = fs_info - > running_transaction )
fs_info - > running_transaction = NULL ;
2011-01-06 19:30:25 +08:00
list_del_init ( & t - > list ) ;
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
2011-01-06 19:30:25 +08:00
2013-09-30 11:36:38 -04:00
btrfs_put_transaction ( t ) ;
2021-11-05 16:45:29 -04:00
trace_btrfs_transaction_commit ( fs_info ) ;
2016-06-22 18:54:23 -04:00
spin_lock ( & fs_info - > trans_lock ) ;
2013-09-30 11:36:38 -04:00
}
2016-06-22 18:54:23 -04:00
spin_unlock ( & fs_info - > trans_lock ) ;
btrfs_destroy_all_ordered_extents ( fs_info ) ;
2016-06-22 18:54:23 -04:00
btrfs_destroy_delayed_inodes ( fs_info ) ;
btrfs_assert_delayed_root_empty ( fs_info ) ;
2016-06-22 18:54:23 -04:00
btrfs_destroy_all_delalloc_inodes ( fs_info ) ;
2020-03-24 10:47:52 -04:00
btrfs_drop_all_logs ( fs_info ) ;
2024-03-26 11:17:12 -07:00
btrfs_free_all_qgroup_pertrans ( fs_info ) ;
2016-06-22 18:54:23 -04:00
mutex_unlock ( & fs_info - > transaction_kthread_mutex ) ;
2011-01-06 19:30:25 +08:00
return 0 ;
}
2020-11-26 15:10:37 +02:00
2020-12-07 17:32:32 +02:00
int btrfs_init_root_free_objectid ( struct btrfs_root * root )
2020-11-26 15:10:37 +02:00
{
struct btrfs_path * path ;
int ret ;
struct extent_buffer * l ;
struct btrfs_key search_key ;
struct btrfs_key found_key ;
int slot ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
search_key . objectid = BTRFS_LAST_FREE_OBJECTID ;
search_key . type = - 1 ;
search_key . offset = ( u64 ) - 1 ;
ret = btrfs_search_slot ( NULL , root , & search_key , path , 0 , 0 ) ;
if ( ret < 0 )
goto error ;
2024-01-23 23:34:57 +01:00
if ( ret = = 0 ) {
/*
* Key with offset - 1 found , there would have to exist a root
* with such id , but this is out of valid range .
*/
ret = - EUCLEAN ;
goto error ;
}
2020-11-26 15:10:37 +02:00
if ( path - > slots [ 0 ] > 0 ) {
slot = path - > slots [ 0 ] - 1 ;
l = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( l , & found_key , slot ) ;
2020-12-07 17:32:36 +02:00
root - > free_objectid = max_t ( u64 , found_key . objectid + 1 ,
BTRFS_FIRST_FREE_OBJECTID ) ;
2020-11-26 15:10:37 +02:00
} else {
2020-12-07 17:32:36 +02:00
root - > free_objectid = BTRFS_FIRST_FREE_OBJECTID ;
2020-11-26 15:10:37 +02:00
}
ret = 0 ;
error :
btrfs_free_path ( path ) ;
return ret ;
}
2020-12-07 17:32:33 +02:00
int btrfs_get_free_objectid ( struct btrfs_root * root , u64 * objectid )
2020-11-26 15:10:37 +02:00
{
int ret ;
mutex_lock ( & root - > objectid_mutex ) ;
2020-12-07 17:32:35 +02:00
if ( unlikely ( root - > free_objectid > = BTRFS_LAST_FREE_OBJECTID ) ) {
2020-11-26 15:10:37 +02:00
btrfs_warn ( root - > fs_info ,
" the objectid of root %llu reaches its highest value " ,
2024-04-15 16:16:23 -04:00
btrfs_root_id ( root ) ) ;
2020-11-26 15:10:37 +02:00
ret = - ENOSPC ;
goto out ;
}
2020-12-07 17:32:36 +02:00
* objectid = root - > free_objectid + + ;
2020-11-26 15:10:37 +02:00
ret = 0 ;
out :
mutex_unlock ( & root - > objectid_mutex ) ;
return ret ;
}