linux/fs
Dave Chinner 348a1983cf xfs: fix unlink vs cluster buffer instantiation race
Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:

 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
 CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
 RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
 RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
 R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
 FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
 xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
 xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
 xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
 xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
 __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
 xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
 xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
 xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
 process_one_work (kernel/workqueue.c:3231)
 worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:147)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
  </TASK>

And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.

The trail of logic that I can reconstruct is as follows.

	1. the inode is clean when inodegc runs, so it is not
	   attached to a cluster buffer when precommit runs.

	2. #1 implies the inode cluster buffer may be clean and not
	   pinned by dirty inodes when inodegc runs.

	3. #2 implies that the inode cluster buffer can be reclaimed
	   by memory pressure at any time.

	4. The assert failure implies that the cluster buffer was
	   attached to the transaction, but not marked done. It had
	   been accessed earlier in the transaction, but not marked
	   done.

	5. #4 implies the cluster buffer has been invalidated (i.e.
	   marked stale).

	6. #5 implies that the inode cluster buffer was instantiated
	   uninitialised in the transaction in xfs_ifree_cluster(),
	   which only instantiates the buffers to invalidate them
	   and never marks them as done.

Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.

I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:

	1. A buffer cache lookup hit on a XBF_STALE buffer will
	   clear the XBF_DONE flag. Hence all future users of the
	   buffer know they have to re-initialise the contents
	   before use and mark it done themselves.

	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
	   means the buffer remains locked until the journal commit
	   completes and the buffer is unpinned. Hence once marked
	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
	   context that can access the freed buffer is the currently
	   running transaction.

	3. #2 implies that future buffer lookups in the currently
	   running transaction will hit the transaction match code
	   and not the buffer cache. Hence XBF_STALE and
	   XFS_BLI_STALE will not be cleared unless the transaction
	   initialises and logs the buffer with valid contents
	   again. At which point, the buffer will be marked marked
	   XBF_DONE again, so having XBF_DONE already set on the
	   stale buffer is a moot point.

	4. #2 also implies that any concurrent access to that
	   cluster buffer will block waiting on the buffer lock
	   until the inode cluster has been fully freed and is no
	   longer an active inode cluster buffer.

	5. #4 + #1 means that any future user of the disk range of
	   that buffer will always see the range of disk blocks
	   covered by the cluster buffer as not done, and hence must
	   initialise the contents themselves.

	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
	   unlinked inode precommit code will see a XBF_DONE buffer
	   from the transaction match as it expects. It can then
	   attach the stale but newly dirtied inode to the stale
	   but newly dirtied cluster buffer without unexpected
	   failures. The stale buffer will then sail through the
	   journal and do the right thing with the attached stale
	   inode during unpin.

Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....

Fixes: 82842fee6e ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-17 11:17:09 +05:30
..
9p Two fixes headed to stable trees: 2024-05-29 09:25:15 -07:00
adfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
affs affs: remove SLAB_MEM_SPREAD flag usage 2024-02-26 11:36:28 +01:00
afs netfs, 9p: Fix race between umount and async request completion 2024-05-27 13:12:13 +02:00
autofs dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
bcachefs bcachefs: Fix rcu_read_lock() leak in drop_extra_replicas 2024-06-11 18:59:08 -04:00
befs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs for-6.10-rc2-tag 2024-06-07 15:13:12 -07:00
cachefiles cachefiles: remove unneeded include of <linux/fdtable.h> 2024-06-03 15:39:17 +02:00
ceph We have a series from Xiubo that adds support for additional access 2024-05-25 14:23:58 -07:00
coda mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
configfs
cramfs use ->bd_mapping instead of ->bd_inode->i_mapping 2024-05-03 02:36:51 -04:00
crypto The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
debugfs debugfs: continue to ignore unknown mount options 2024-05-28 14:32:42 +02:00
devpts fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
dlm dlm: return -ENOMEM if ls_recover_buf fails 2024-04-23 16:08:55 -05:00
ecryptfs hardening updates for 6.10-rc1 2024-05-13 14:14:05 -07:00
efivarfs efi: Clear up misconceptions about a maximum variable name size 2024-04-13 10:33:02 +02:00
efs efs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:33 +01:00
erofs Changes since last update: 2024-05-24 09:31:50 -07:00
exfat exfat: zero the reserved fields of file and stream extension dentries 2024-04-25 21:59:59 +09:00
exportfs fs: Create a generic is_dot_dotdot() utility 2024-01-23 10:58:56 -05:00
ext2 ext2: Remove LEGACY_DIRECT_IO dependency 2024-05-03 11:50:28 +02:00
ext4 bd_inode series 2024-05-21 09:51:42 -07:00
f2fs f2fs update for 6.10-rc1 2024-05-20 13:23:43 -07:00
fat fs: add kernel-doc comments to fat_parse_long() 2024-04-25 21:07:02 -07:00
freevxfs freevxfs: Convert freevxfs to the new mount API. 2024-03-26 09:04:53 +01:00
fuse virtio: features, fixes, cleanups 2024-05-23 12:04:36 -07:00
gfs2 bd_inode series 2024-05-21 09:51:42 -07:00
hfs hfs: really remove hfs_writepage 2023-12-29 11:58:34 -08:00
hfsplus hfsplus: refactor copy_name to not use strncpy 2024-04-24 16:55:28 -07:00
hostfs hostfs: use d_splice_alias() calling conventions to simplify failure exits 2023-12-21 12:51:00 -05:00
hpfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
hugetlbfs The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
iomap iomap: Fix iomap_adjust_read_range for plen calculation 2024-06-05 17:27:03 +02:00
isofs isofs: Use *-y instead of *-objs in Makefile 2024-05-09 18:09:57 +02:00
jbd2 bd_inode series 2024-05-21 09:51:42 -07:00
jffs2 This pull request contains the following changes for JFFS2: 2024-05-25 13:23:42 -07:00
jfs jfs: xattr: fix buffer overflow for invalid xattr 2024-06-04 18:09:03 +02:00
kernfs kernfs: mount: Remove unnecessary ‘NULL’ values from knparent 2024-05-04 19:02:39 +02:00
lockd lockd: host: Remove unnecessary statements'host = NULL;' 2024-05-06 09:07:20 -04:00
minix minix: convert minix to use the new mount api 2024-03-26 09:04:55 +01:00
netfs vfs-6.10-rc2.fixes 2024-05-27 08:09:12 -07:00
nfs NFS client bugfixes for Linux 6.10 2024-06-13 11:07:32 -07:00
nfs_common
nfsd tracing/treewide: Remove second parameter of __assign_str() 2024-05-22 20:14:47 -04:00
nilfs2 nilfs2: fix nilfs_empty_dir() misjudgment and long loop on I/O errors 2024-06-05 19:19:27 -07:00
nls
notify Revert "fanotify: remove unneeded sub-zero check for unsigned value" 2024-05-20 12:43:58 -07:00
ntfs3 driver ntfs3 for linux 6.10 2024-05-25 14:19:01 -07:00
ocfs2 tracing/treewide: Remove second parameter of __assign_str() 2024-05-22 20:14:47 -04:00
omfs
openpromfs openpromfs: finish conversion to the new mount API 2024-03-26 09:04:54 +01:00
orangefs orangefs: fix out-of-bounds fsid access 2024-05-14 17:44:14 -07:00
overlayfs overlayfs update for 6.10 2024-05-22 09:23:18 -07:00
proc mm/ksm: fix ksm_zero_pages accounting 2024-06-05 19:19:26 -07:00
pstore pstore/zone: Don't clear memory twice 2024-03-09 12:33:22 -08:00
qnx4 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
qnx6 qnx6: convert qnx6 to use the new mount api 2024-03-26 09:04:53 +01:00
quota quota: fix to propagate error of mark_dquot_dirty() to caller 2024-04-12 14:52:29 +02:00
ramfs mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
reiserfs getting rid of bogus set_blocksize() uses, switching it 2024-05-21 08:34:51 -07:00
romfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
smb ksmbd: fix missing use of get_write in in smb2_set_ea() 2024-06-11 23:43:09 -05:00
squashfs Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
sysfs Merge 6.9-rc5 into driver-core-next 2024-04-23 13:27:43 +02:00
sysv sysv: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
tracefs eventfs: Do not use attributes for events directory 2024-05-23 09:31:50 -04:00
ubifs This pull request contains updates for UBI and UBIFS: 2024-03-21 15:09:29 -07:00
udf udf: Use a folio in udf_write_end() 2024-04-23 15:37:02 +02:00
ufs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
unicode kbuild: use $(src) instead of $(srctree)/$(src) for source directory 2024-05-10 04:34:52 +09:00
vboxsf vboxsf: explicitly deny setlease attempts 2024-04-03 16:06:39 +02:00
verity fsverity: use register_sysctl_init() to avoid kmemleak warning 2024-05-03 08:30:58 -07:00
xfs xfs: fix unlink vs cluster buffer instantiation race 2024-06-17 11:17:09 +05:30
zonefs zonefs: Use str_plural() to fix Coccinelle warning 2024-04-10 07:23:47 +09:00
aio.c Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
anon_inodes.c fs: Create anon_inode_getfile_fmode() 2024-04-26 10:33:05 +02:00
attr.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
backing-file.c ovl: implement tmpfile 2024-05-02 20:35:57 +02:00
bad_inode.c
binfmt_elf_fdpic.c binfmt_elf_fdpic: fix /proc/<pid>/auxv 2024-04-24 15:55:28 -07:00
binfmt_elf_test.c
binfmt_elf.c Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
buffer.c bd_inode series 2024-05-21 09:51:42 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c virtio: features, fixes, cleanups 2024-05-23 12:04:36 -07:00
d_path.c
dax.c dax: use huge_zero_folio 2024-04-25 20:56:20 -07:00
dcache.c Revert "vfs: Delete the associated dentry when deleting a file" 2024-05-29 09:39:34 -07:00
direct-io.c fs/direct-io: remove redundant assignment to variable retval 2024-04-11 10:21:24 +02:00
drop_caches.c
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
fcntl.c fcntl: add F_DUPFD_QUERY fcntl() 2024-05-10 08:26:31 +02:00
fhandle.c fs: Annotate struct file_handle with __counted_by() and use struct_size() 2024-04-05 15:53:47 +02:00
file_table.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
file.c fs/file: fix the check in find_next_fd() 2024-05-30 09:11:47 +02:00
filesystems.c
fs_context.c
fs_parser.c __fs_parse: Correct a documentation comment 2024-02-02 13:11:50 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c fs/writeback: remove unnecessary return in writeback_inodes_sb 2024-04-05 15:53:45 +02:00
fsopen.c
init.c
inode.c bcachefs updates for 6.9 2024-03-15 09:00:09 -07:00
internal.h ovl: implement tmpfile 2024-05-02 20:35:57 +02:00
ioctl.c fs/ioctl: Add a comment to keep the logic in sync with LSM policies 2024-05-13 06:58:35 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c shmem: Fix shmem_rename2() 2024-04-17 13:49:44 +02:00
locks.c filelock: fix deadlock detection in POSIX locking 2024-02-20 09:53:33 +01:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h
mpage.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
namei.c overlayfs update for 6.10 2024-05-22 09:23:18 -07:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-07 21:16:29 +01:00
nsfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
open.c do_dentry_open(): kill inode argument 2024-04-15 16:03:25 -04:00
pidfs.c fs/pidfs: make 'lsof' happy with our inode changes 2024-05-21 08:08:00 -07:00
pipe.c fs/pipe: Convert to lockdep_cmp_fn 2024-02-02 13:11:49 +01:00
pnode.c
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c
read_write.c Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
readdir.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
remap_range.c vfs: export remap and write check helpers 2024-04-15 14:54:13 -07:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c seq_file: Simplify __seq_puts() 2024-05-02 16:28:20 +02:00
signalfd.c signalfd: drop an obsolete comment 2024-05-24 13:34:07 +02:00
splice.c remove call_{read,write}_iter() functions 2024-04-15 16:03:25 -04:00
stack.c
stat.c statx: stx_subvol 2024-03-26 09:01:18 +01:00
statfs.c
super.c \n 2024-05-20 12:31:43 -07:00
sync.c
sysctls.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
timerfd.c timerfd: convert to ->read_iter() 2024-04-10 16:23:02 -06:00
userfaultfd.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
utimes.c
xattr.c evm: Move to LSM infrastructure 2024-02-15 23:43:47 -05:00