linux/fs
Omar Sandoval 9d274c19a7 btrfs: fix crash on racing fsync and size-extending write into prealloc
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-05 18:06:30 +02:00
..
9p fs/9p: mitigate inode collisions 2024-04-22 15:34:27 +00:00
adfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
affs affs: remove SLAB_MEM_SPREAD flag usage 2024-02-26 11:36:28 +01:00
afs afs: Fix occasional rmdir-then-VNOVNODE with generic/011 2024-03-14 12:13:21 +01:00
autofs dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
bcachefs bcachefs: fix integer conversion bug 2024-04-28 21:34:29 -04:00
befs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs btrfs: fix crash on racing fsync and size-extending write into prealloc 2024-06-05 18:06:30 +02:00
cachefiles cachefiles: fix memory leak in cachefiles_add_cache() 2024-02-20 09:46:07 +01:00
ceph ceph: switch to use cap_delay_lock for the unlink delay list 2024-04-11 22:56:28 +02:00
coda mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
configfs
cramfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
crypto fscrypt updates for 6.9 2024-03-12 13:17:36 -07:00
debugfs debugfs: fix wait/cancellation handling during remove 2024-03-07 22:08:15 +00:00
devpts fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
dlm dlm for 6.9 2024-03-18 15:39:48 -07:00
ecryptfs Merge tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux 2024-01-23 17:56:30 +01:00
efivarfs efivarfs: Drop 'duplicates' bool parameter on efivar_init() 2024-02-25 09:43:39 +01:00
efs efs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:33 +01:00
erofs erofs: reliably distinguish block based and fscache mode 2024-04-28 20:36:52 +08:00
exfat Description for this pull request: 2024-03-21 09:47:12 -07:00
exportfs fs: Create a generic is_dot_dotdot() utility 2024-01-23 10:58:56 -05:00
ext2 \n 2024-03-13 14:30:58 -07:00
ext4 fs,block: yield devices early 2024-03-27 13:17:15 +01:00
f2fs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
fat - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
freevxfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
fuse cuse: add kernel-doc comments to cuse_process_init_reply() 2024-04-15 11:02:10 +02:00
gfs2 gfs2 fix 2024-03-25 10:53:39 -07:00
hfs hfs: really remove hfs_writepage 2023-12-29 11:58:34 -08:00
hfsplus vfs-6.9.misc 2024-03-11 09:38:17 -07:00
hostfs hostfs: use d_splice_alias() calling conventions to simplify failure exits 2023-12-21 12:51:00 -05:00
hpfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
hugetlbfs vfs-6.9.misc 2024-03-11 09:38:17 -07:00
iomap vfs-6.9.rw_hint 2024-03-04 18:35:21 +01:00
isofs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
jbd2 jbd2: abort journal when detecting metadata writeback error of fs dev 2024-01-04 23:42:21 -05:00
jffs2 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
jfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
kernfs kernfs: annotate different lockdep class for of->mutex of writable files 2024-04-14 06:55:46 -04:00
lockd NFSD 6.9 Release Notes 2024-03-12 14:27:37 -07:00
minix minix: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
netfs netfs: Fix the pre-flush when appending to a file in writethrough mode 2024-04-26 14:56:18 +02:00
nfs NFS client bugfixes for Linux 6.9 2024-04-29 12:07:37 -07:00
nfs_common
nfsd nfsd-6.9 fixes: 2024-04-29 14:22:24 -07:00
nilfs2 nilfs2: fix OOB in nilfs_set_de_type 2024-04-16 15:39:52 -07:00
nls
notify fanotify: allow freeze when waiting response for permission events 2024-03-07 12:59:51 +01:00
ntfs3 ntfs3: add legacy ntfs file operations 2024-04-23 09:39:07 +02:00
ocfs2 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
omfs
openpromfs openpromfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
orangefs Julia Lawall reported this null pointer dereference, this should fix it. 2024-02-14 15:57:53 -05:00
overlayfs ovl: relax WARN_ON in ovl_verify_area() 2024-03-17 15:59:41 +02:00
proc mm: support page_mapcount() on page_has_type() pages 2024-04-24 19:34:26 -07:00
pstore pstore/zone: Don't clear memory twice 2024-03-09 12:33:22 -08:00
qnx4 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
qnx6 qnx6: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
quota mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
ramfs ramfs: Initialize security of in-memory inodes 2024-01-26 09:08:16 -08:00
reiserfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
romfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
smb smb3: fix lock ordering potential deadlock in cifs_sync_mid_result 2024-04-25 12:49:50 -05:00
squashfs Squashfs: check the inode number is not the invalid value of zero 2024-04-16 15:39:50 -07:00
sysfs fs: sysfs: Fix reference leak in sysfs_break_active_protection() 2024-04-11 15:16:48 +02:00
sysv sysv: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
tracefs eventfs: Have "events" directory get permissions from its parent 2024-05-04 04:25:37 -04:00
ubifs This pull request contains updates for UBI and UBIFS: 2024-03-21 15:09:29 -07:00
udf mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
ufs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
unicode
vboxsf vboxsf: explicitly deny setlease attempts 2024-04-03 16:06:39 +02:00
verity Networking changes for 6.9. 2024-03-12 17:44:08 -07:00
xfs Bug fixes for 6.9-rc3: 2024-04-06 09:14:18 -07:00
zonefs zonefs: Use str_plural() to fix Coccinelle warning 2024-04-10 07:23:47 +09:00
aio.c aio: Fix null ptr deref in aio_complete() wakeup 2024-04-05 11:20:28 +02:00
anon_inodes.c Merge branch 'kvm-guestmemfd' into HEAD 2023-11-14 08:31:31 -05:00
attr.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
backing-file.c fs: Use KMEM_CACHE instead of kmem_cache_create 2024-02-02 13:11:50 +01:00
bad_inode.c
binfmt_elf_fdpic.c binfmt: replace deprecated strncpy 2024-03-21 20:20:52 -07:00
binfmt_elf_test.c
binfmt_elf.c
binfmt_flat.c
binfmt_misc.c execve updates for v6.7-rc1 2023-10-30 19:28:19 -10:00
binfmt_script.c
buffer.c vfs-6.9.iomap 2024-03-11 10:07:03 -07:00
char_dev.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
compat_binfmt_elf.c
coredump.c iov_iter: get rid of 'copy_mc' flag 2024-03-06 10:52:12 +01:00
d_path.c
dax.c fs : Fix warning using plain integer as NULL 2023-11-18 15:00:01 +01:00
dcache.c vfs-6.9.misc 2024-03-11 09:38:17 -07:00
direct-io.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
drop_caches.c
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c execve fixes for v6.9-rc2 2024-03-27 09:57:30 -07:00
fcntl.c vfs-6.9.iomap 2024-03-11 10:07:03 -07:00
fhandle.c do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak 2024-01-22 15:33:38 +01:00
file_table.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
file.c file: remove __receive_fd() 2023-12-12 14:24:14 +01:00
filesystems.c
fs_context.c
fs_parser.c __fs_parse: Correct a documentation comment 2024-02-02 13:11:50 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c writeback: move wb_wakeup_delayed defination to fs-writeback.c 2024-01-22 15:33:38 +01:00
fsopen.c
init.c
inode.c bcachefs updates for 6.9 2024-03-15 09:00:09 -07:00
internal.h pidfs: remove config option 2024-03-13 12:53:53 -07:00
ioctl.c fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail 2024-04-09 12:03:49 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
locks.c filelock: fix deadlock detection in POSIX locking 2024-02-20 09:53:33 +01:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h mounts: keep list of mounts in an rbtree 2023-11-18 14:56:16 +01:00
mpage.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
namei.c security: Place security_path_post_mknod() where the original IMA call was 2024-04-03 10:21:32 -07:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-07 21:16:29 +01:00
nsfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
open.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
pidfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
pipe.c fs/pipe: Convert to lockdep_cmp_fn 2024-02-02 13:11:49 +01:00
pnode.c mounts: keep list of mounts in an rbtree 2023-11-18 14:56:16 +01:00
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c namespace: extract show_path() helper 2023-11-18 14:56:16 +01:00
read_write.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
readdir.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
remap_range.c remap_range: merge do_clone_file_range() into vfs_clone_file_range() 2024-02-06 17:07:21 +01:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c
signalfd.c
splice.c fs: use splice_copy_file_range() inline helper 2023-12-12 16:20:02 +01:00
stack.c
stat.c vfs-6.8.mount 2024-01-08 10:57:34 -08:00
statfs.c
super.c fs,block: yield devices early 2024-03-27 13:17:15 +01:00
sync.c
sysctls.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
timerfd.c
userfaultfd.c userfaultfd: use per-vma locks in userfaultfd operations 2024-02-22 15:27:20 -08:00
utimes.c
xattr.c evm: Move to LSM infrastructure 2024-02-15 23:43:47 -05:00