linux/fs/btrfs
Filipe Manana eb96e22193 btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
   address 36007936, for the leaf/root of a new subvolume that has an ID
   of 291. We mark the extent buffer as dirty, and at this point the
   subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
   create_subvol(). We used to, but that was recently removed in
   commit 1b53e51a4a ("btrfs: don't commit transaction for every subvol
   create");

3) The transaction used to create the subvolume has an ID of 33, so the
   extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
   files created and its tree height changes from 0 to 1, so we end up with
   a new root at level 1 and the extent buffer 36007936 is now a leaf of
   that new root node, which is extent buffer 36048896.

   The commit root remains as 36007936, since we are still at transaction
   33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
   ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
   we end up at transaction.c:create_pending_snapshot(), in the critical
   section of a transaction commit.

   There we COW the root of subvolume 291, which is extent buffer 36048896.
   The COW operation returns extent buffer 36048896, since there's no need
   to COW because the extent buffer was created in this transaction and it
   was not written yet.

   The we call btrfs_copy_root() against the root node 36048896. During
   this operation we allocate a new extent buffer to turn into the root
   node of the snapshot, copy the contents of the root node 36048896 into
   this snapshot root extent buffer, set the owner to 292 (the ID of the
   snapshot), etc, and then we call btrfs_inc_ref(). This will create a
   delayed reference for each leaf pointed by the root node with a
   reference root of 292 - this includes a reference for the leaf
   36007936.

   After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

   Then we call btrfs_insert_dir_item(), to create the directory entry in
   in the tree of subvolume 291 that points to the snapshot. This ends up
   needing to modify leaf 36007936 to insert the respective directory
   items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
   we need to COW the leaf. We end up at btrfs_force_cow_block() and then
   at update_ref_for_cow().

   At update_ref_for_cow() we call btrfs_block_can_be_shared() which
   returns false, despite the fact the leaf 36007936 is shared - the
   subvolume's root and the snapshot's root point to that leaf. The
   reason that it incorrectly returns false is because the commit root
   of the subvolume is extent buffer 36007936 - it was the initial root
   of the subvolume when we created it. So btrfs_block_can_be_shared()
   which has the following logic:

   int btrfs_block_can_be_shared(struct btrfs_root *root,
                                 struct extent_buffer *buf)
   {
       if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
           buf != root->node && buf != root->commit_root &&
           (btrfs_header_generation(buf) <=
            btrfs_root_last_snapshot(&root->root_item) ||
            btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
               return 1;

       return 0;
   }

   Returns false (0) since 'buf' (extent buffer 36007936) matches the
   root's commit root.

   As a result, at update_ref_for_cow(), we don't check for the number
   of references for extent buffer 36007936, we just assume it's not
   shared and therefore that it has only 1 reference, so we set the local
   variable 'refs' to 1.

   Later on, in the final if-else statement at update_ref_for_cow():

   static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                          struct btrfs_root *root,
                                          struct extent_buffer *buf,
                                          struct extent_buffer *cow,
                                          int *last_ref)
   {
      (...)
      if (refs > 1) {
          (...)
      } else {
          (...)
          btrfs_clear_buffer_dirty(trans, buf);
          *last_ref = 1;
      }
   }

   So we mark the extent buffer 36007936 as not dirty, and as a result
   we don't write it to disk later in the transaction commit, despite the
   fact that the snapshot's root points to it.

   Attempting to access the leaf or dumping the tree for example shows
   that the extent buffer was not written:

   $ btrfs inspect-internal dump-tree -t 292 /dev/sdb
   btrfs-progs v6.2.2
   file tree key (292 ROOT_ITEM 33)
   node 36110336 level 1 items 2 free space 119 generation 33 owner 292
   node 36110336 flags 0x1(WRITTEN) backref revision 1
   checksum stored a8103e3e
   checksum calced a8103e3e
   fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
           key (256 INODE_ITEM 0) block 36007936 gen 33
           key (257 EXTENT_DATA 0) block 36052992 gen 33
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   total bytes 107374182400
   bytes used 38572032
   uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

   The respective on disk region is full of zeroes as the device was
   trimmed at mkfs time.

   Obviously 'btrfs check' also detects and complains about this:

   $ btrfs check /dev/sdb
   Opening filesystem to check...
   Checking filesystem on /dev/sdb
   UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   generation: 33 (33)
   [1/7] checking root items
   [2/7] checking extents
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   owner ref check failed [36007936 4096]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   The following tree block(s) is corrupted in tree 292:
        tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
   root 292 root dir 256 not found
   ERROR: errors found in fs roots
   found 38572032 bytes used, error(s) found
   total csum bytes: 16048
   total tree bytes: 1265664
   total fs tree bytes: 1118208
   total extent tree bytes: 65536
   btree space waste bytes: 562598
   file data blocks allocated: 65978368
    referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

   $ cat test.sh
   #!/bin/bash

   MNT=/mnt/sdi
   DEV=/dev/sdi

   # Use a filesystem with a 64K node size so that we have the same node
   # size on every machine regardless of its page size (on x86_64 default
   # node size is 16K due to the 4K page size, while on PPC it's 64K by
   # default). This way we can make sure we are able to create a btree for
   # the subvolume with a height of 2.
   mkfs.btrfs -f -n 64K $DEV
   mount $DEV $MNT

   btrfs subvolume create $MNT/subvol

   # Create a few empty files on the subvolume, this bumps its btree
   # height to 2 (root node at level 1 and 2 leaves).
   for ((i = 1; i <= 300; i++)); do
       echo -n > $MNT/subvol/file_$i
   done

   btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

   umount $DEV

   btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

   $ ./test.sh
   Create subvolume '/mnt/sdi/subvol'
   Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
   Opening filesystem to check...
   Checking filesystem on /dev/sdi
   UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
   [1/7] checking root items
   [2/7] checking extents
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   owner ref check failed [30539776 65536]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
   Wrong generation of child node/leaf, wanted: 5, have: 7
   root 257 root dir 256 not found
   ERROR: errors found in fs roots
   found 917504 bytes used, error(s) found
   total csum bytes: 0
   total tree bytes: 851968
   total fs tree bytes: 393216
   total extent tree bytes: 65536
   btree space waste bytes: 736550
   file data blocks allocated: 0
    referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-23 17:17:30 +02:00
..
tests btrfs: tests: test invalid splitting when skipping pinned drop extent_map 2023-08-21 14:54:49 +02:00
accessors.c btrfs: add eb to btrfs_node_key_ptr_offset 2022-12-05 18:00:58 +01:00
accessors.h btrfs: use helper sizeof_field in struct accessors 2023-08-21 14:52:13 +02:00
acl.c fs: port acl to mnt_idmap 2023-01-19 09:24:28 +01:00
acl.h fs: port ->set_acl() to pass mnt_idmap 2023-01-19 09:24:27 +01:00
async-thread.c btrfs: use alloc_ordered_workqueue() to create ordered workqueues 2023-06-19 13:59:30 +02:00
async-thread.h btrfs: use alloc_ordered_workqueue() to create ordered workqueues 2023-06-19 13:59:30 +02:00
backref.c btrfs: fix unwritten extent buffer after snapshotting a new subvolume 2023-10-23 17:17:30 +02:00
backref.h btrfs: fix unwritten extent buffer after snapshotting a new subvolume 2023-10-23 17:17:30 +02:00
bio.c btrfs: add an ordered_extent pointer to struct btrfs_bio 2023-06-19 13:59:36 +02:00
bio.h btrfs: add an ordered_extent pointer to struct btrfs_bio 2023-06-19 13:59:36 +02:00
block-group.c btrfs: fix race between finishing block group creation and its item update 2023-09-08 14:10:36 +02:00
block-group.h btrfs: rename add_new_free_space() to btrfs_add_new_free_space() 2023-08-21 14:52:12 +02:00
block-rsv.c btrfs: account block group tree when calculating global reserve size 2023-07-20 19:22:54 +02:00
block-rsv.h btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c 2023-06-19 13:59:24 +02:00
btrfs_inode.h btrfs: reduce the number of arguments to btrfs_run_delalloc_range 2023-08-21 14:52:14 +02:00
check-integrity.c btrfs: rename __btrfs_map_block to btrfs_map_block 2023-06-19 13:59:34 +02:00
check-integrity.h
compression.c btrfs: make btrfs_compressed_bioset static 2023-06-19 17:01:44 +02:00
compression.h btrfs: pass an ordered_extent to btrfs_submit_compressed_write 2023-06-19 13:59:36 +02:00
ctree.c btrfs: fix unwritten extent buffer after snapshotting a new subvolume 2023-10-23 17:17:30 +02:00
ctree.h btrfs: fix unwritten extent buffer after snapshotting a new subvolume 2023-10-23 17:17:30 +02:00
defrag.c btrfs: drop gfp from parameter extent state helpers 2023-06-19 13:59:30 +02:00
defrag.h btrfs: move defrag related prototypes to their own header 2022-12-05 18:00:46 +01:00
delalloc-space.c btrfs: count extents before taking inode's spinlock when reserving metadata 2023-04-17 18:01:19 +02:00
delalloc-space.h btrfs: move delalloc space related prototypes to delalloc-space.h 2022-12-05 18:00:44 +01:00
delayed-inode.c btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size() 2023-10-11 11:37:19 +02:00
delayed-inode.h btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size() 2023-10-11 11:37:19 +02:00
delayed-ref.c btrfs: prevent transaction block reserve underflow when starting transaction 2023-09-20 20:42:18 +02:00
delayed-ref.h btrfs: prevent transaction block reserve underflow when starting transaction 2023-09-20 20:42:18 +02:00
dev-replace.c btrfs: make find_first_extent_bit() return a boolean 2023-08-21 14:52:12 +02:00
dev-replace.h btrfs: move dev-replace prototypes into dev-replace.h 2022-12-05 18:00:47 +01:00
dir-item.c btrfs: move dir-item prototypes into dir-item.h 2022-12-05 18:00:46 +01:00
dir-item.h btrfs: move dir-item prototypes into dir-item.h 2022-12-05 18:00:46 +01:00
discard.c btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
discard.h btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
disk-io.c btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio 2023-09-08 14:11:04 +02:00
disk-io.h btrfs: make btrfs_cleanup_fs_roots() static 2023-08-21 14:52:18 +02:00
export.c btrfs: move super_block specific helpers into super.h 2022-12-05 18:00:47 +01:00
export.h btrfs: simplify generation check in btrfs_get_dentry 2022-12-05 18:00:41 +01:00
extent_io.c btrfs: reset destination buffer when read_extent_buffer() gets invalid range 2023-09-20 20:44:57 +02:00
extent_io.h btrfs: zoned: introduce block group context to btrfs_eb_write_context 2023-08-21 14:52:19 +02:00
extent_map.c btrfs: fix incorrect splitting in btrfs_drop_extent_map_range 2023-08-18 14:38:10 +02:00
extent_map.h btrfs: pass the new logical address to split_extent_map 2023-06-19 13:59:33 +02:00
extent-io-tree.c btrfs: make find_first_extent_bit() return a boolean 2023-08-21 14:52:12 +02:00
extent-io-tree.h btrfs: make find_first_extent_bit() return a boolean 2023-08-21 14:52:12 +02:00
extent-tree.c btrfs: log message if extent item not found when running delayed extent op 2023-09-20 20:42:58 +02:00
extent-tree.h btrfs: wait on uncached block groups on every allocation loop 2023-08-21 14:54:47 +02:00
file-item.c btrfs: scrub: avoid unnecessary csum tree search preparing stripes 2023-08-21 14:54:48 +02:00
file-item.h btrfs: scrub: avoid unnecessary csum tree search preparing stripes 2023-08-21 14:54:48 +02:00
file.c btrfs: file_remove_privs needs an exclusive lock in direct io write 2023-09-13 18:41:03 +02:00
file.h btrfs: use cached state when looking for delalloc ranges with fiemap 2022-12-05 18:00:56 +01:00
free-space-cache.c btrfs: zoned: no longer count fresh BG region as zone unusable 2023-08-21 14:52:19 +02:00
free-space-cache.h btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c 2023-06-19 13:59:24 +02:00
free-space-tree.c btrfs: rename add_new_free_space() to btrfs_add_new_free_space() 2023-08-21 14:52:12 +02:00
free-space-tree.h btrfs: make clear_cache mount option to rebuild FST without disabling it 2023-05-10 14:51:27 +02:00
fs.c btrfs: sysfs: update fs features directory asynchronously 2023-02-13 17:50:35 +01:00
fs.h btrfs: zoned: activate metadata block group on write time 2023-08-21 14:52:19 +02:00
inode-item.c btrfs: remove obsolete delayed ref throttling logic when truncating items 2023-04-17 18:01:19 +02:00
inode-item.h btrfs: move split_flags/combine_flags helpers to inode-item.h 2023-06-19 13:59:25 +02:00
inode.c btrfs: fix race between reading a directory and adding entries to it 2023-09-14 23:24:42 +02:00
ioctl.c btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c 2023-10-04 01:03:05 +02:00
ioctl.h fs: port ->fileattr_set() to pass mnt_idmap 2023-01-19 09:24:27 +01:00
Kconfig MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org 2023-09-08 14:21:27 +02:00
locking.c btrfs: add block-group tree to lockdep classes 2023-06-19 13:59:35 +02:00
locking.h btrfs: do not block starts waiting on previous transaction commit 2023-09-08 14:10:49 +02:00
lru_cache.c btrfs: send: cache utimes operations for directories if possible 2023-02-15 19:38:50 +01:00
lru_cache.h btrfs: remove btrfs_lru_cache_is_full() inline function 2023-04-17 18:01:18 +02:00
lzo.c btrfs: disable allocation warnings for compression workspaces 2023-06-19 13:59:34 +02:00
Makefile btrfs: send: genericize the backref cache to allow it to be reused 2023-02-13 17:50:35 +01:00
messages.c btrfs: remove v0 extent handling 2023-08-21 14:54:48 +02:00
messages.h btrfs: remove v0 extent handling 2023-08-21 14:54:48 +02:00
misc.h btrfs: export bitmap_test_range_all_{set,zero} 2023-06-19 13:59:22 +02:00
ordered-data.c btrfs: check for BTRFS_FS_ERROR in pending ordered assert 2023-09-08 14:10:59 +02:00
ordered-data.h btrfs: add a btrfs_finish_ordered_extent helper 2023-06-19 13:59:37 +02:00
orphan.c btrfs: move orphan prototypes into orphan.h 2022-12-05 18:00:47 +01:00
orphan.h btrfs: move orphan prototypes into orphan.h 2022-12-05 18:00:47 +01:00
print-tree.c btrfs: remove v0 extent handling 2023-08-21 14:54:48 +02:00
print-tree.h btrfs: print-tree: pass const extent buffer pointer 2023-06-19 13:59:22 +02:00
props.c btrfs: move super_block specific helpers into super.h 2022-12-05 18:00:47 +01:00
props.h btrfs: make module init/exit match their sequence 2022-12-05 18:00:40 +01:00
qgroup.c btrfs: avoid start and commit empty transaction when flushing qgroups 2023-08-21 14:52:18 +02:00
qgroup.h btrfs: sink gfp_t parameter to btrfs_qgroup_trace_extent 2022-12-05 18:00:43 +01:00
raid56.c btrfs: scrub: avoid unnecessary csum tree search preparing stripes 2023-08-21 14:54:48 +02:00
raid56.h btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSING 2023-08-21 14:52:12 +02:00
rcu-string.h btrfs: replace strncpy() with strscpy() 2022-12-05 18:00:59 +01:00
ref-verify.c btrfs: move accessor helpers into accessors.h 2022-12-05 18:00:42 +01:00
ref-verify.h
reflink.c btrfs: pass btrfs_inode to btrfs_inode_unlock 2022-12-05 18:00:53 +01:00
reflink.h
relocation.c btrfs: fix unwritten extent buffer after snapshotting a new subvolume 2023-10-23 17:17:30 +02:00
relocation.h btrfs: pass an ordered_extent to btrfs_reloc_clone_csums 2023-06-19 13:59:36 +02:00
root-tree.c btrfs: move orphan prototypes into orphan.h 2022-12-05 18:00:47 +01:00
root-tree.h btrfs: move root tree prototypes to their own header 2022-12-05 18:00:44 +01:00
scrub.c btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() 2023-08-21 14:54:49 +02:00
scrub.h btrfs: scrub: remove scrub_bio structure 2023-04-17 18:01:24 +02:00
send.c btrfs: use LIST_HEAD() to initialize the list_head 2023-08-21 14:54:46 +02:00
send.h btrfs: send add define for v2 buffer size 2022-12-05 18:00:41 +01:00
space-info.c btrfs: zoned: re-enable metadata over-commit for zoned mode 2023-08-21 14:52:19 +02:00
space-info.h btrfs: update documentation for BTRFS_RESERVE_FLUSH_EVICT flush method 2023-04-17 18:01:18 +02:00
subpage.c btrfs: stop setting PageError in the data I/O path 2023-06-19 13:59:35 +02:00
subpage.h btrfs: stop setting PageError in the data I/O path 2023-06-19 13:59:35 +02:00
super.c Revert "btrfs: reject unknown mount options early" 2023-10-10 15:27:56 +02:00
super.h btrfs: move super_block specific helpers into super.h 2022-12-05 18:00:47 +01:00
sysfs.c btrfs: sysfs: show if ACL support has been compiled in 2023-08-21 14:52:12 +02:00
sysfs.h btrfs: sysfs: update fs features directory asynchronously 2023-02-13 17:50:35 +01:00
transaction.c btrfs: prevent transaction block reserve underflow when starting transaction 2023-09-20 20:42:18 +02:00
transaction.h btrfs: always print transaction aborted messages with an error level 2023-10-04 01:03:59 +02:00
tree-checker.c for-6.5-rc5-tag 2023-08-12 13:28:55 -07:00
tree-checker.h btrfs: move btrfs_verify_level_key into tree-checker.c 2023-06-19 13:59:25 +02:00
tree-log.c btrfs: initialize start_slot in btrfs_log_prealloc_extents 2023-09-21 18:52:23 +02:00
tree-log.h btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool 2023-06-19 13:59:26 +02:00
tree-mod-log.c btrfs: avoid tree mod log ENOMEM failures when we don't need to log 2023-06-19 13:59:38 +02:00
tree-mod-log.h btrfs: fix SPDX comment in tree-mod-log.h 2022-12-05 18:00:48 +01:00
ulist.c btrfs: constify ulist parameter of ulist_next() 2022-12-05 18:00:50 +01:00
ulist.h btrfs: constify ulist parameter of ulist_next() 2022-12-05 18:00:50 +01:00
uuid-tree.c btrfs: move uuid tree prototypes to uuid-tree.h 2022-12-05 18:00:46 +01:00
uuid-tree.h btrfs: move uuid tree prototypes to uuid-tree.h 2022-12-05 18:00:46 +01:00
verity.c btrfs: convert btrfs_read_merkle_tree_page() to use a folio 2023-09-13 18:40:54 +02:00
verity.h btrfs: move verity prototypes into verity.h 2022-12-05 18:00:47 +01:00
volumes.c btrfs: fix stripe length calculation for non-zoned data chunk allocation 2023-10-15 19:00:59 +02:00
volumes.h btrfs: add a helper to read the superblock metadata_uuid 2023-08-21 14:54:48 +02:00
xattr.c fs: drop unused posix acl handlers 2023-03-06 09:57:12 +01:00
xattr.h
zlib.c btrfs: disable allocation warnings for compression workspaces 2023-06-19 13:59:34 +02:00
zoned.c btrfs: zoned: skip splitting and logical rewriting on pre-alloc write 2023-08-22 14:19:59 +02:00
zoned.h btrfs: zoned: reserve zones for an active metadata/system block group 2023-08-21 14:52:19 +02:00
zstd.c btrfs: disable allocation warnings for compression workspaces 2023-06-19 13:59:34 +02:00