linux/fs
Filipe Manana 50460e3718 Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:

[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---

This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:

         CPU 1                                    CPU 2

 btrfs_dev_replace_finishing()

   at this point
    dev_replace->tgtdev->devid ==
    BTRFS_DEV_REPLACE_DEVID (0ULL)

   ...

   btrfs_start_transaction()
   btrfs_commit_transaction()

                                               btrfs_fallocate()
                                                 btrfs_alloc_data_chunk_ondemand()
                                                   btrfs_join_transaction()
                                                     --> starts a new transaction
                                                   do_chunk_alloc()
                                                     lock fs_info->chunk_mutex
                                                       btrfs_alloc_chunk()
                                                         --> creates extent map for
                                                             the new chunk with
                                                             em->bdev->map->stripes[i]->dev->devid
                                                             == X (X > 0)
                                                         --> extent map is added to
                                                             fs_info->mapping_tree
                                                         --> initial phase of bg A
                                                             allocation completes
                                                     unlock fs_info->chunk_mutex

   lock fs_info->chunk_mutex

   btrfs_dev_replace_update_device_in_mapping_tree()
     --> iterates fs_info->mapping_tree and
         replaces the device in every extent
         map's map->stripes[] with
         dev_replace->tgtdev, which still has
         an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)

                                                   btrfs_end_transaction()
                                                     btrfs_create_pending_block_groups()
                                                       --> starts final phase of
                                                           bg A creation (update device,
                                                           extent, and chunk trees, etc)
                                                       btrfs_finish_chunk_alloc()

                                                         btrfs_update_device()
                                                           --> attempts to update a device
                                                               item with ID == 0ULL
                                                               (BTRFS_DEV_REPLACE_DEVID)
                                                               which is the current ID of
                                                               bg A's
                                                               em->bdev->map->stripes[i]->dev->devid
                                                           --> doesn't find such item
                                                               returns -ENOENT
                                                           --> the device id should have been X
                                                               and not 0ULL

                                                       got -ENOENT from
                                                       btrfs_finish_chunk_alloc()
                                                       and aborts current transaction

   finishes setting up the target device,
   namely it sets tgtdev->devid to the value
   of srcdev->devid, which is X (and X > 0)

   frees the srcdev

   unlock fs_info->chunk_mutex

So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.

This happened while running fstest btrfs/071.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17 10:59:46 +00:00
..
9p 9p: fix return code of read() when count is 0 2015-08-23 14:21:36 -05:00
adfs fs/adfs: remove unneeded cast 2015-06-30 19:44:57 -07:00
affs fs/affs: make root lookup from blkdev logical size 2015-09-10 13:29:01 -07:00
afs net: Add a struct net parameter to sock_create_kern 2015-05-11 10:50:17 -04:00
autofs4 make simple_positive() public 2015-06-23 18:02:01 -04:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
bfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
btrfs Btrfs: fix race when finishing dev replace leading to transaction abort 2015-12-17 10:59:46 +00:00
cachefiles Merge branch 'fscache-fixes' into for-next 2015-06-23 18:01:30 -04:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2015-09-11 12:33:03 -07:00
cifs [CIFS] Update cifs version number 2015-10-03 16:54:17 -05:00
coda fs/coda: fix readlink buffer overflow 2015-09-10 13:29:01 -07:00
configfs configfs: fix kernel infoleak through user-controlled format string 2015-07-17 16:39:53 -07:00
cramfs
debugfs debugfs: Export bool read/write functions 2015-07-20 18:44:50 +01:00
devpts devpts: if initialization failed, don't crash when opening /dev/ptmx 2015-06-30 19:44:58 -07:00
dlm dlm for 4.3 2015-09-03 12:57:48 -07:00
ecryptfs Invalidate stale eCryptfs dcache entries caused by unlinked lower inodes 2015-09-08 11:26:17 -07:00
efivarfs Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-05-06 10:57:37 -07:00
efs fs/efs: femove unneeded cast 2015-06-25 17:00:42 -07:00
exofs pagemap.h: move dir_pages() over there 2015-06-23 18:02:00 -04:00
exportfs VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
ext2 ext2: huge page fault support 2015-09-08 15:35:28 -07:00
ext4 ext4: start transaction before calling into DAX 2015-09-08 15:35:28 -07:00
f2fs Merge tag 'for-f2fs-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs 2015-09-03 13:10:22 -07:00
fat writeback: separate out include/linux/backing-dev-defs.h 2015-06-02 08:33:34 -06:00
freevxfs freevxfs: Grammar s/an negative/a negative/ 2015-08-07 13:59:24 +02:00
fscache FS-Cache: Retain the netfs context in the retrieval op earlier 2015-04-02 14:28:53 +01:00
fuse fs/fuse: fix ioctl type confusion 2015-08-16 12:35:44 -07:00
gfs2 GFS2: merge window 2015-09-11 12:23:51 -07:00
hfs hfs: fix B-tree corruption after insertion at position 0 2015-09-10 13:29:01 -07:00
hfsplus hfs,hfsplus: cache pages correctly between bnode_create and bnode_free 2015-09-10 13:29:01 -07:00
hostfs fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
hpfs hpfs: update ctime and mtime on directory modification 2015-09-03 11:55:30 -07:00
hugetlbfs hugetlbfs: add hugetlbfs_fallocate() 2015-09-08 15:35:28 -07:00
isofs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
jbd2 jbd2: limit number of reserved credits 2015-08-04 11:21:52 -04:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
jfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2015-09-03 12:28:30 -07:00
kernfs kernfs: implement kernfs_path_len() 2015-08-18 15:49:15 -07:00
lockd lockd: NLM grace period shouldn't block NFSv4 opens 2015-08-13 10:22:06 -04:00
logfs block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
minix Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
ncpfs ncpfs: successful rename() should invalidate caches for parents 2015-06-14 11:31:39 -04:00
nfs NFS: Fix a tracepoint NULL-pointer dereference 2015-10-06 18:56:25 -04:00
nfs_common lockd: NLM grace period shouldn't block NFSv4 opens 2015-08-13 10:22:06 -04:00
nfsd NFS client updates for Linux 4.3 2015-09-07 14:02:24 -07:00
nilfs2 block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
nls
notify Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-09-05 20:34:28 -07:00
ntfs ntfs: delete unnecessary checks before calling iput() 2015-09-04 16:54:41 -07:00
ocfs2 ocfs2/dlm: fix deadlock when dispatch assert master 2015-09-22 15:09:53 -07:00
omfs omfs: fix potential integer overflow in allocator 2015-05-28 18:25:19 -07:00
openpromfs
overlayfs fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
proc proc: convert to kstrto*()/kstrto*_from_user() 2015-09-10 13:29:01 -07:00
pstore Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2015-07-03 15:20:57 -07:00
qnx4
qnx6 pagemap.h: move dir_pages() over there 2015-06-23 18:02:00 -04:00
quota Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-09-05 20:34:28 -07:00
ramfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
reiserfs fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
romfs make new_sync_{read,write}() static 2015-04-11 22:29:40 -04:00
squashfs fs: cleanup slight list_entry abuse 2015-06-23 18:01:59 -04:00
sysfs vfs: Commit to never having exectuables on proc and sysfs. 2015-07-10 10:39:25 -05:00
sysv pagemap.h: move dir_pages() over there 2015-06-23 18:02:00 -04:00
tracefs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
ubifs UBIFS: Kill unneeded locking in ubifs_init_security 2015-09-29 12:45:42 +02:00
udf udf: Don't modify filesystem for read-only mounts 2015-08-20 14:58:35 +02:00
ufs fix ufs write vs readpage race when writing into a hole 2015-09-09 10:43:12 -07:00
xfs xfs: huge page fault support 2015-09-08 15:35:28 -07:00
aio.c mm: move ->mremap() from file_operations to vm_operations_struct 2015-09-04 16:54:41 -07:00
anon_inodes.c
attr.c
bad_inode.c don't bother with most of the bad_file_ops methods 2015-02-20 04:03:58 -05:00
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
binfmt_script.c
block_dev.c blockdev: don't set S_DAX for misaligned partitions 2015-09-15 20:08:05 -04:00
buffer.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
char_dev.c fs/char_dev.c: fix incorrect documentation for unregister_chrdev_region 2015-08-05 13:49:35 -07:00
compat_binfmt_elf.c
compat_ioctl.c ioctl_compat: handle FITRIM 2015-07-09 11:42:21 -07:00
compat.c
coredump.c fs: Don't dump core if the corefile would become world-readable. 2015-09-10 13:29:01 -07:00
dax.c dax: fix NULL pointer in __dax_pmd_fault() 2015-10-01 21:42:35 -04:00
dcache.c dcache: Reduce the scope of i_lock in d_splice_alias 2015-08-21 02:34:37 -04:00
dcookies.c
direct-io.c block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
drop_caches.c inode: convert inode_sb_list_lock to per-sb 2015-08-17 18:39:46 -04:00
eventfd.c eventfd: don't take the spinlock in eventfd_poll 2015-02-17 14:34:52 -08:00
eventpoll.c epoll: optimize setting task running after blocking 2015-02-13 21:21:40 -08:00
exec.c vfs: Commit to never having exectuables on proc and sysfs. 2015-07-10 10:39:25 -05:00
fcntl.c
fhandle.c vfs: read file_handle only once in handle_to_path 2015-06-02 10:29:07 -07:00
file_table.c fs, file table: reinit files_stat.max_files after deferred memory initialisation 2015-08-07 04:39:40 +03:00
file.c fs/file.c: __fget() and dup2() atomicity rules 2015-07-01 02:31:08 -04:00
filesystems.c
fs_pin.c fs_pin: Allow for the possibility that m_list or s_list go unused. 2015-04-09 11:39:55 -05:00
fs_struct.c
fs-writeback.c fs-writeback: unplug before cond_resched in writeback_sb_inodes 2015-09-19 18:50:19 -07:00
inode.c inode: don't softlockup when evicting inodes 2015-08-18 10:20:09 -07:00
internal.h inode: rename i_wb_list to i_io_list 2015-08-17 23:38:10 -04:00
ioctl.c fsioctl.c: make generic_block_fiemap() signal-tolerant 2015-02-10 14:30:30 -08:00
Kconfig fs: Remove ext3 filesystem driver 2015-07-23 20:59:40 +02:00
Kconfig.binfmt mm: split ET_DYN ASLR from mmap ASLR 2015-04-14 16:49:05 -07:00
libfs.c fs: Set the size of empty dirs to 0. 2015-08-12 15:28:45 -05:00
locks.c fs: fix fs/locks.c kernel-doc warning 2015-08-31 16:27:25 -04:00
Makefile userfaultfd: buildsystem activation 2015-09-04 16:54:41 -07:00
mbcache.c
mount.h fs: use seq_open_private() for proc_mounts 2015-06-30 19:44:56 -07:00
mpage.c block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
namei.c namei: results of d_is_negative() should be checked after dentry revalidation 2015-10-10 10:17:27 -07:00
namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2015-09-01 16:13:25 -07:00
no-block.c
nsfs.c fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void 2015-09-11 15:21:34 -07:00
open.c vfs: Commit to never having exectuables on proc and sysfs. 2015-07-10 10:39:25 -05:00
pipe.c VFS: assorted weird filesystems: d_inode() annotations 2015-04-15 15:06:58 -04:00
pnode.c mnt: Don't propagate unmounts to locked mounts 2015-04-02 20:34:20 -05:00
pnode.h mnt: Clarify and correct the disconnect logic in umount_tree 2015-07-22 20:33:27 -05:00
posix_acl.c fs/posix_acl.c: make posix_acl_create() safer and cleaner 2015-06-23 18:01:07 -04:00
proc_namespace.c fs: use seq_open_private() for proc_mounts 2015-06-30 19:44:56 -07:00
read_write.c new_sync_write(): discard ->ki_pos unless the return value is positive 2015-04-11 22:29:46 -04:00
readdir.c
select.c locking/arch: Rename set_mb() to smp_store_mb() 2015-05-19 08:32:00 +02:00
seq_file.c fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void 2015-09-11 15:21:34 -07:00
signalfd.c signalfd: fix information leak in signalfd_copyinfo 2015-08-07 04:39:40 +03:00
splice.c Merge branch 'akpm' (patches from Andrew) 2015-06-24 20:47:21 -07:00
stack.c
stat.c VFS: assorted d_backing_inode() annotations 2015-04-15 15:06:59 -04:00
statfs.c
super.c Merge branch 'superblock-scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-next 2015-08-21 02:31:20 -04:00
sync.c vfs: add support for a lazytime mount option 2015-02-05 02:45:00 -05:00
timerfd.c
userfaultfd.c userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" 2015-09-22 15:09:53 -07:00
utimes.c
xattr.c evm: fix potential race when removing xattrs 2015-05-21 13:28:47 -04:00