linux/fs
Filipe Manana b850ae1427 Btrfs: fix deadlock between direct IO write and defrag/readpages
If readpages() (triggered by defrag or buffered reads) is called while a
direct IO write is in progress, we have a small time window where we can
deadlock, resulting in traces like the following being generated:

[84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
[84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
[84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
[84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
[84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
[84723.232085] Call Trace:
[84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
[84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
[84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
[84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
[84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
[84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
[84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
[84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
[84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
[84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
[84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
[84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
[84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
[84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
[84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
[84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
[84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
[84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
[84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
[84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
[84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
[84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
[84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
[84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
[84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
[84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
[84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
[84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
[84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
[84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[84723.279595] INFO: lockdep is turned off.
[84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
[84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
[84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
[84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
[84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
[84723.388620] Call Trace:
[84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
[84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
[84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
[84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
[84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
[84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
[84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
[84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
[84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
[84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
[84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
[84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
[84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
[84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f

Consider the following logical and physical file layout:

logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                4K                    8K                    16K

physical:   ... 12853248              12857344              1103101952   ...
                                      (= 12853248 + 4K)

Extents A and B are physically adjacent. The following diagram shows a
sequence of events that lead to the deadlock when we attempt to do a
direct IO write against the file range [4K, 16K[ and a defrag is triggered
simultaneously.

           CPU 1                                               CPU 2

 btrfs_direct_IO()

   btrfs_get_blocks_direct()
     creates ordered extent A, covering
     the 4k prealloc extent A (range [4K, 8K[)

                                                    btrfs_defrag_file()
                                                      page_cache_sync_readahead([0K, 1M[)
                                                        btrfs_readpages()
                                                          extent_readpages()

                                                            locks all pages in the file
                                                            range [0K, 128K[ through calls
                                                            to add_to_page_cache_lru()

                                                            __do_contiguous_readpages()

                                                               finds ordered extent A

                                                               waits for it to complete

   btrfs_get_blocks_direct() called again

     lock_extent_direct(range [8K, 16K[)

       finds a page in range [8K, 16K[ through
       btrfs_page_exists_in_range()

       invalidate_inode_pages2_range([8K, 16K[)

         --> tries to lock pages that are already
             locked by the task at CPU 2

         --> our task, running __blockdev_direct_IO(),
             hangs waiting to lock the pages and the
             submit bio callback, btrfs_submit_direct(),
             ends up never being called, resulting in the
             ordered extent A never completing (because a
             corresponding bio is never submitted) and
             CPU 2 will wait for it forever while holding
             the pages locked
              ---> deadlock!

Fix this by removing the page invalidation approach when attempting to
lock the range for IO from the callback btrfs_get_blocks_direct() and
falling back buffered IO. This was a rare case anyway and well behaved
applications do not mix concurrent direct IO writes with buffered reads
anyway, being a concurrent defrag the only normal case that could lead
to the deadlock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17 10:59:50 +00:00
..
9p 9p: fix return code of read() when count is 0 2015-08-23 14:21:36 -05:00
adfs fs/adfs: remove unneeded cast 2015-06-30 19:44:57 -07:00
affs fs/affs: make root lookup from blkdev logical size 2015-09-10 13:29:01 -07:00
afs net: Add a struct net parameter to sock_create_kern 2015-05-11 10:50:17 -04:00
autofs4 make simple_positive() public 2015-06-23 18:02:01 -04:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
bfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
btrfs Btrfs: fix deadlock between direct IO write and defrag/readpages 2015-12-17 10:59:50 +00:00
cachefiles Merge branch 'fscache-fixes' into for-next 2015-06-23 18:01:30 -04:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2015-09-11 12:33:03 -07:00
cifs [CIFS] Update cifs version number 2015-10-03 16:54:17 -05:00
coda fs/coda: fix readlink buffer overflow 2015-09-10 13:29:01 -07:00
configfs configfs: fix kernel infoleak through user-controlled format string 2015-07-17 16:39:53 -07:00
cramfs
debugfs debugfs: Export bool read/write functions 2015-07-20 18:44:50 +01:00
devpts devpts: if initialization failed, don't crash when opening /dev/ptmx 2015-06-30 19:44:58 -07:00
dlm dlm for 4.3 2015-09-03 12:57:48 -07:00
ecryptfs Invalidate stale eCryptfs dcache entries caused by unlinked lower inodes 2015-09-08 11:26:17 -07:00
efivarfs Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-05-06 10:57:37 -07:00
efs fs/efs: femove unneeded cast 2015-06-25 17:00:42 -07:00
exofs pagemap.h: move dir_pages() over there 2015-06-23 18:02:00 -04:00
exportfs VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
ext2 ext2: huge page fault support 2015-09-08 15:35:28 -07:00
ext4 ext4: start transaction before calling into DAX 2015-09-08 15:35:28 -07:00
f2fs Merge tag 'for-f2fs-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs 2015-09-03 13:10:22 -07:00
fat writeback: separate out include/linux/backing-dev-defs.h 2015-06-02 08:33:34 -06:00
freevxfs freevxfs: Grammar s/an negative/a negative/ 2015-08-07 13:59:24 +02:00
fscache FS-Cache: Retain the netfs context in the retrieval op earlier 2015-04-02 14:28:53 +01:00
fuse fs/fuse: fix ioctl type confusion 2015-08-16 12:35:44 -07:00
gfs2 GFS2: merge window 2015-09-11 12:23:51 -07:00
hfs hfs: fix B-tree corruption after insertion at position 0 2015-09-10 13:29:01 -07:00
hfsplus hfs,hfsplus: cache pages correctly between bnode_create and bnode_free 2015-09-10 13:29:01 -07:00
hostfs fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
hpfs hpfs: update ctime and mtime on directory modification 2015-09-03 11:55:30 -07:00
hugetlbfs hugetlbfs: add hugetlbfs_fallocate() 2015-09-08 15:35:28 -07:00
isofs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
jbd2 jbd2: limit number of reserved credits 2015-08-04 11:21:52 -04:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
jfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2015-09-03 12:28:30 -07:00
kernfs kernfs: implement kernfs_path_len() 2015-08-18 15:49:15 -07:00
lockd lockd: NLM grace period shouldn't block NFSv4 opens 2015-08-13 10:22:06 -04:00
logfs block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
minix Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
ncpfs ncpfs: successful rename() should invalidate caches for parents 2015-06-14 11:31:39 -04:00
nfs NFS: Fix a tracepoint NULL-pointer dereference 2015-10-06 18:56:25 -04:00
nfs_common lockd: NLM grace period shouldn't block NFSv4 opens 2015-08-13 10:22:06 -04:00
nfsd NFS client updates for Linux 4.3 2015-09-07 14:02:24 -07:00
nilfs2 block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
nls
notify Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-09-05 20:34:28 -07:00
ntfs ntfs: delete unnecessary checks before calling iput() 2015-09-04 16:54:41 -07:00
ocfs2 ocfs2/dlm: fix deadlock when dispatch assert master 2015-09-22 15:09:53 -07:00
omfs omfs: fix potential integer overflow in allocator 2015-05-28 18:25:19 -07:00
openpromfs
overlayfs fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
proc proc: convert to kstrto*()/kstrto*_from_user() 2015-09-10 13:29:01 -07:00
pstore Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2015-07-03 15:20:57 -07:00
qnx4
qnx6 pagemap.h: move dir_pages() over there 2015-06-23 18:02:00 -04:00
quota Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-09-05 20:34:28 -07:00
ramfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
reiserfs fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
romfs make new_sync_{read,write}() static 2015-04-11 22:29:40 -04:00
squashfs fs: cleanup slight list_entry abuse 2015-06-23 18:01:59 -04:00
sysfs vfs: Commit to never having exectuables on proc and sysfs. 2015-07-10 10:39:25 -05:00
sysv pagemap.h: move dir_pages() over there 2015-06-23 18:02:00 -04:00
tracefs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
ubifs UBIFS: Kill unneeded locking in ubifs_init_security 2015-09-29 12:45:42 +02:00
udf udf: Don't modify filesystem for read-only mounts 2015-08-20 14:58:35 +02:00
ufs fix ufs write vs readpage race when writing into a hole 2015-09-09 10:43:12 -07:00
xfs xfs: huge page fault support 2015-09-08 15:35:28 -07:00
aio.c mm: move ->mremap() from file_operations to vm_operations_struct 2015-09-04 16:54:41 -07:00
anon_inodes.c
attr.c
bad_inode.c don't bother with most of the bad_file_ops methods 2015-02-20 04:03:58 -05:00
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
binfmt_script.c
block_dev.c blockdev: don't set S_DAX for misaligned partitions 2015-09-15 20:08:05 -04:00
buffer.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
char_dev.c fs/char_dev.c: fix incorrect documentation for unregister_chrdev_region 2015-08-05 13:49:35 -07:00
compat_binfmt_elf.c
compat_ioctl.c ioctl_compat: handle FITRIM 2015-07-09 11:42:21 -07:00
compat.c
coredump.c fs: Don't dump core if the corefile would become world-readable. 2015-09-10 13:29:01 -07:00
dax.c dax: fix NULL pointer in __dax_pmd_fault() 2015-10-01 21:42:35 -04:00
dcache.c dcache: Reduce the scope of i_lock in d_splice_alias 2015-08-21 02:34:37 -04:00
dcookies.c
direct-io.c block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
drop_caches.c inode: convert inode_sb_list_lock to per-sb 2015-08-17 18:39:46 -04:00
eventfd.c eventfd: don't take the spinlock in eventfd_poll 2015-02-17 14:34:52 -08:00
eventpoll.c epoll: optimize setting task running after blocking 2015-02-13 21:21:40 -08:00
exec.c vfs: Commit to never having exectuables on proc and sysfs. 2015-07-10 10:39:25 -05:00
fcntl.c
fhandle.c vfs: read file_handle only once in handle_to_path 2015-06-02 10:29:07 -07:00
file_table.c fs, file table: reinit files_stat.max_files after deferred memory initialisation 2015-08-07 04:39:40 +03:00
file.c fs/file.c: __fget() and dup2() atomicity rules 2015-07-01 02:31:08 -04:00
filesystems.c
fs_pin.c fs_pin: Allow for the possibility that m_list or s_list go unused. 2015-04-09 11:39:55 -05:00
fs_struct.c
fs-writeback.c fs-writeback: unplug before cond_resched in writeback_sb_inodes 2015-09-19 18:50:19 -07:00
inode.c inode: don't softlockup when evicting inodes 2015-08-18 10:20:09 -07:00
internal.h inode: rename i_wb_list to i_io_list 2015-08-17 23:38:10 -04:00
ioctl.c fsioctl.c: make generic_block_fiemap() signal-tolerant 2015-02-10 14:30:30 -08:00
Kconfig fs: Remove ext3 filesystem driver 2015-07-23 20:59:40 +02:00
Kconfig.binfmt mm: split ET_DYN ASLR from mmap ASLR 2015-04-14 16:49:05 -07:00
libfs.c fs: Set the size of empty dirs to 0. 2015-08-12 15:28:45 -05:00
locks.c fs: fix fs/locks.c kernel-doc warning 2015-08-31 16:27:25 -04:00
Makefile userfaultfd: buildsystem activation 2015-09-04 16:54:41 -07:00
mbcache.c
mount.h fs: use seq_open_private() for proc_mounts 2015-06-30 19:44:56 -07:00
mpage.c block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
namei.c namei: results of d_is_negative() should be checked after dentry revalidation 2015-10-10 10:17:27 -07:00
namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2015-09-01 16:13:25 -07:00
no-block.c
nsfs.c fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void 2015-09-11 15:21:34 -07:00
open.c vfs: Commit to never having exectuables on proc and sysfs. 2015-07-10 10:39:25 -05:00
pipe.c VFS: assorted weird filesystems: d_inode() annotations 2015-04-15 15:06:58 -04:00
pnode.c mnt: Don't propagate unmounts to locked mounts 2015-04-02 20:34:20 -05:00
pnode.h mnt: Clarify and correct the disconnect logic in umount_tree 2015-07-22 20:33:27 -05:00
posix_acl.c fs/posix_acl.c: make posix_acl_create() safer and cleaner 2015-06-23 18:01:07 -04:00
proc_namespace.c fs: use seq_open_private() for proc_mounts 2015-06-30 19:44:56 -07:00
read_write.c new_sync_write(): discard ->ki_pos unless the return value is positive 2015-04-11 22:29:46 -04:00
readdir.c
select.c locking/arch: Rename set_mb() to smp_store_mb() 2015-05-19 08:32:00 +02:00
seq_file.c fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void 2015-09-11 15:21:34 -07:00
signalfd.c signalfd: fix information leak in signalfd_copyinfo 2015-08-07 04:39:40 +03:00
splice.c Merge branch 'akpm' (patches from Andrew) 2015-06-24 20:47:21 -07:00
stack.c
stat.c VFS: assorted d_backing_inode() annotations 2015-04-15 15:06:59 -04:00
statfs.c
super.c Merge branch 'superblock-scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-next 2015-08-21 02:31:20 -04:00
sync.c vfs: add support for a lazytime mount option 2015-02-05 02:45:00 -05:00
timerfd.c
userfaultfd.c userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" 2015-09-22 15:09:53 -07:00
utimes.c
xattr.c evm: fix potential race when removing xattrs 2015-05-21 13:28:47 -04:00