linux

iv/linux

History

Filipe Manana 2766ff6176 btrfs: update the number of bytes used by an inode atomically There are several occasions where we do not update the inode's number of used bytes atomically, resulting in a concurrent stat(2) syscall to report a value of used blocks that does not correspond to a valid value, that is, a value that does not match neither what we had before the operation nor what we get after the operation completes. In extreme cases it can result in stat(2) reporting zero used blocks, which can cause problems for some userspace tools where they can consider a file with a non-zero size and zero used blocks as completely sparse and skip reading data, as reported/discussed a long time ago in some threads like the following: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html The cases where this can happen are the following: -> Case 1 If we do a write (buffered or direct IO) against a file region for which there is already an allocated extent (or multiple extents), then we have a short time window where we can report a number of used blocks to stat(2) that does not take into account the file region being overwritten. This short time window happens when completing the ordered extent(s). This happens because when we drop the extents in the write range we decrement the inode's number of bytes and later on when we insert the new extent(s) we increment the number of bytes in the inode, resulting in a short time window where a stat(2) syscall can get an incorrect number of used blocks. If we do writes that overwrite an entire file, then we have a short time window where we report 0 used blocks to stat(2). Example reproducer: $ cat reproducer-1.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null expected=$(stat -c %b $MNT/foobar) # Create a process to keep calling stat(2) on the file and see if the # reported number of blocks used (disk space used) changes, it should # not because we are not increasing the file size nor punching holes. stat_loop $MNT/foobar $expected & loop_pid=$! for ((i = 0; i < 50000; i++)); do xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null done kill $loop_pid &> /dev/null wait umount $DEV $ ./reproducer-1.sh ERROR: unexpected used blocks (got: 0 expected: 128) ERROR: unexpected used blocks (got: 0 expected: 128) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 2 If we do a buffered write against a file region that does not have any allocated extents, like a hole or beyond EOF, then during ordered extent completion we have a short time window where a concurrent stat(2) syscall can report a number of used blocks that does not correspond to the value before or after the write operation, a value that is actually larger than the value after the write completes. This happens because once we start a buffered write into an unallocated file range we increment the inode's 'new_delalloc_bytes', to make sure any stat(2) call gets a correct used blocks value before delalloc is flushed and completes. However at ordered extent completion, after we inserted the new extent, we increment the inode's number of bytes used with the size of the new extent, and only later, when clearing the range in the inode's iotree, we decrement the inode's 'new_delalloc_bytes' counter with the size of the extent. So this results in a short time window where a concurrent stat(2) syscall can report a number of used blocks that accounts for the new extent twice. Example reproducer: $ cat reproducer-2.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foobar write_size=$((64 * 1024)) for ((i = 0; i < 16384; i++)); do offset=$(($i * $write_size)) xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null blocks_used=$(stat -c %b $MNT/foobar) # Fsync the file to trigger writeback and keep calling stat(2) on it # to see if the number of blocks used changes. stat_loop $MNT/foobar $blocks_used & loop_pid=$! xfs_io -c "fsync" $MNT/foobar kill $loop_pid &> /dev/null wait $loop_pid done umount $DEV $ ./reproducer-2.sh ERROR: unexpected used blocks (got: 265472 expected: 265344) ERROR: unexpected used blocks (got: 284032 expected: 283904) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 3 Another case where such problems happen is during other operations that replace extents in a file range with other extents. Those operations are extent cloning, deduplication and fallocate's zero range operation. The cause of the problem is similar to the first case. When we drop the extents from a range, we decrement the inode's number of bytes, and later on, after inserting the new extents we increment it. Since this is not done atomically, a concurrent stat(2) call can see and return a number of used blocks that is smaller than it should be, does not match the number of used blocks before or after the clone/deduplication/zero operation. Like for the first case, when doing a clone, deduplication or zero range operation against an entire file, we end up having a time window where we can report 0 used blocks to a stat(2) call. Example reproducer: $ cat reproducer-3.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f -m reflink=1 $DEV > /dev/null mount $DEV $MNT extent_size=$((64 * 1024)) num_extents=16384 file_size=$(($extent_size * $num_extents)) # File foo has many small extents. xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \ > /dev/null # File bar has much less extents and has exactly the same data as foo. xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null expected=$(stat -c %b $MNT/foo) # Now deduplicate bar into foo. While the deduplication is in progres, # the number of used blocks/file size reported by stat should not change xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null & dedupe_pid=$! while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do used=$(stat -c %b $MNT/foo) if [ $used -ne $expected ]; then echo "Unexpected blocks used: $used (expected: $expected)" fi done umount $DEV $ ./reproducer-3.sh Unexpected blocks used: 2076800 (expected: 2097152) Unexpected blocks used: 2097024 (expected: 2097152) Unexpected blocks used: 2079872 (expected: 2097152) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. So fix this by: 1) Making btrfs_drop_extents() not decrement the VFS inode's number of bytes, and instead return the number of bytes; 2) Making any code that drops extents and adds new extents update the inode's number of bytes atomically, while holding the btrfs inode's spinlock, which is also used by the stat(2) callback to get the inode's number of bytes; 3) For ranges in the inode's iotree that are marked as 'delalloc new', corresponding to previously unallocated ranges, increment the inode's number of bytes when clearing the 'delalloc new' bit from the range, in the same critical section that decrements the inode's 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock. An alternative would be to have btrfs_getattr() wait for any IO (ordered extents in progress) and locking the whole range (0 to (u64)-1) while it it computes the number of blocks used. But that would mean blocking stat(2), which is a very used syscall and expected to be fast, waiting for writes, clone/dedupe, fallocate, page reads, fiemap, etc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2020-12-08 15:54:08 +01:00
..
9p	fs: 9p: add generic splice_write file operation	2020-12-01 21:40:47 +01:00
adfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
affs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
afs	afs: Fix speculative status fetch going out of order wrt to modifications	2020-11-22 11:27:03 -08:00
autofs	autofs: harden ioctl table	2020-10-16 11:11:22 -07:00
befs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
bfs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
btrfs	btrfs: update the number of bytes used by an inode atomically	2020-12-08 15:54:08 +01:00
cachefiles	cachefiles: Handle readpage error correctly	2020-10-26 10:42:54 -07:00
ceph	ceph: check session state after bumping session->s_seq	2020-11-04 20:55:49 +01:00
cifs	cifs: refactor create_sd_buf() and and avoid corrupting the buffer	2020-12-03 17:12:14 -06:00
coda	docs: filesystems: convert coda.txt to ReST	2020-05-05 09:22:21 -06:00
configfs	fs: configfs: delete repeated words in comments	2020-10-16 11:11:19 -07:00
cramfs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
crypto	fscrypt: fix inline encryption not used on new files	2020-11-11 20:59:07 -08:00
debugfs	debugfs: remove return value of debugfs_create_devm_seqfile()	2020-10-30 08:37:39 +01:00
devpts
dlm	networking changes for the 5.10 merge window	2020-10-15 18:42:13 -07:00
ecryptfs	mm, treewide: rename kzfree() to kfree_sensitive()	2020-08-07 11:33:22 -07:00
efivarfs	efivarfs: revert "fix memory leak in efivarfs_create()"	2020-11-25 16:55:02 +01:00
efs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
erofs	erofs: fix setting up pcluster for temporary pages	2020-11-04 09:15:48 +08:00
exfat	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
exportfs
ext2	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
ext4	ext4: fix bogus warning in ext4_update_dx_flag()	2020-11-19 22:41:10 -05:00
f2fs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
fat	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
freevxfs
fscache	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-06-03 16:27:18 -07:00
fuse	fuse update for 5.10	2020-10-19 14:28:30 -07:00
gfs2	gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func	2020-12-01 00:21:10 +01:00
hfs	fs: Replace zero-length array with flexible-array member	2020-10-29 17:22:59 -05:00
hfsplus	fs: Replace zero-length array with flexible-array member	2020-10-29 17:22:59 -05:00
hostfs	hostfs: Use kasprintf() instead of fixed buffer formatting	2020-03-29 23:23:00 +02:00
hpfs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
hugetlbfs	hugetlbfs: prevent filesystem stacking of hugetlbfs	2020-08-12 10:57:56 -07:00
iomap	iomap: clean up writeback state logic on writepage error	2020-11-04 08:52:46 -08:00
isofs	fs: Replace zero-length array with flexible-array member	2020-10-29 17:22:59 -05:00
jbd2	jbd2: fix kernel-doc markups	2020-11-19 22:38:29 -05:00
jffs2	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
jfs	fs: Introduce i_blocks_per_page	2020-09-21 08:59:26 -07:00
kernfs	fsnotify: pass dir and inode arguments to fsnotify()	2020-07-27 23:15:48 +02:00
lockd	The one new feature this time, from Anna Schumaker, is READ_PLUS, which	2020-10-22 09:44:27 -07:00
minix	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
nfs	NFS: Remove unnecessary inode lock in nfs_fsync_dir()	2020-11-12 10:41:26 -05:00
nfs_common	NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy	2020-10-21 10:31:20 -04:00
nfsd	NFSD: fix missing refcount in nfsd4_copy by nfsd4_do_async_copy	2020-11-05 17:25:14 -05:00
nilfs2	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
nls	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
notify	fanotify: fix logic of reporting name info with watched parent	2020-11-09 15:03:08 +01:00
ntfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
ocfs2	ocfs2: initialize ip_next_orphan	2020-11-14 11:26:04 -08:00
omfs	fs: omfs: use kmemdup() rather than kmalloc+memcpy	2020-09-22 23:39:45 -04:00
openpromfs
orangefs	orangefs: remove unnecessary assignment to variable ret	2020-08-04 15:01:58 -04:00
overlayfs	ovl: use generic vfs_ioc_setflags_prepare() helper	2020-10-06 15:38:15 +02:00
proc	io_uring-5.10-2020-11-20	2020-11-20 11:47:22 -08:00
pstore	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
qnx4	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
qnx6	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
quota	\n	2020-10-15 14:56:15 -07:00
ramfs	ramfs: fix nommu mmap with gaps in the page cache	2020-10-16 11:11:22 -07:00
reiserfs	reiserfs: Fix oops during mount	2020-10-01 11:15:31 +02:00
romfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
squashfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
sysfs	sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output	2020-10-02 12:02:30 +02:00
sysv	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
tracefs
ubifs	This pull request contains fixes for UBI and UBIFS	2020-10-18 09:56:50 -07:00
udf	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
ufs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
unicode	unicode: Add utf8_casefold_hash	2020-09-10 14:03:31 -07:00
vboxsf	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial	2020-10-15 15:11:56 -07:00
verity	fs-verity: use smp_load_acquire() for ->i_verity_info	2020-07-21 16:02:41 -07:00
xfs	xfs: revert "xfs: fix rmap key and record comparison functions"	2020-11-19 15:17:50 -08:00
zonefs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
aio.c	vfs: separate __sb_start_write into blocking and non-blocking helpers	2020-11-10 16:53:07 -08:00
anon_inodes.c
attr.c
bad_inode.c	fs: move the fiemap definitions out of fs.h	2020-06-03 23:16:55 -04:00
binfmt_aout.c	exec: Rename flush_old_exec begin_new_exec	2020-05-07 16:55:47 -05:00
binfmt_elf_fdpic.c	binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot	2020-10-16 11:11:21 -07:00
binfmt_elf.c	fs: Replace zero-length array with flexible-array member	2020-10-29 17:22:59 -05:00
binfmt_em86.c	Merge branch 'akpm' (patches from Andrew)	2020-06-04 19:18:29 -07:00
binfmt_flat.c	binfmt_flat: revert "binfmt_flat: don't offset the data start"	2020-08-24 08:49:13 +10:00
binfmt_misc.c	Merge branch 'akpm' (patches from Andrew)	2020-06-04 19:18:29 -07:00
binfmt_script.c	Merge branch 'akpm' (patches from Andrew)	2020-06-04 19:18:29 -07:00
block_dev.c	block: add a bdget_part helper	2020-10-05 10:38:33 -06:00
buffer.c	mm, memcg: rework remote charging API to support nesting	2020-10-18 09:27:09 -07:00
char_dev.c	vfs: allow unprivileged whiteout creation	2020-05-14 16:44:23 +02:00
compat_binfmt_elf.c	Split the old READ_IMPLIES_EXEC workaround from executable PT_GNU_STACK	2020-06-05 13:45:21 -07:00
coredump.c	coredump: fix core_pattern parse error	2020-12-06 10:19:07 -08:00
d_path.c	fs: fix NULL dereference due to data race in prepend_path()	2020-10-14 14:54:45 -07:00
dax.c	fuse update for 5.10	2020-10-19 14:28:30 -07:00
dcache.c	vfs: Use sequence counter with associated spinlock	2020-07-29 16:14:27 +02:00
dcookies.c
direct-io.c	\n	2020-10-15 15:03:10 -07:00
drop_caches.c	sysctl: pass kernel pointers to ->proc_handler	2020-04-27 02:07:40 -04:00
eventfd.c	eventfd: convert to f_op->read_iter()	2020-05-06 22:33:43 -04:00
eventpoll.c	ep_create_wakeup_source(): dentry name can change under you...	2020-09-24 19:41:58 -04:00
exec.c	powerpc updates for 5.10	2020-10-16 12:21:15 -07:00
fcntl.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
fhandle.c
file_table.c	task_work: cleanup notification modes	2020-10-17 15:05:30 -06:00
file.c	io_uring: don't rely on weak ->files references	2020-09-30 20:32:32 -06:00
filesystems.c	fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()	2020-04-10 15:36:22 -07:00
fs_context.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
fs_parser.c	fs_parse: mark fs_param_bad_value() as static	2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c	vfs: Use sequence counter with associated spinlock	2020-07-29 16:14:27 +02:00
fs_types.c
fs-writeback.c	block-5.10-2020-10-12	2020-10-13 12:12:44 -07:00
fsopen.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
init.c	init: add an init_dup helper	2020-08-04 21:02:38 -04:00
inode.c	fs: add a filesystem flag for THPs	2020-10-16 11:11:15 -07:00
internal.h	fs: remove compat_sys_mount	2020-09-22 23:45:57 -04:00
io_uring.c	io_uring: fix recvmsg setup with compat buf-select	2020-11-30 11:12:03 -07:00
io-wq.c	io-wq: cancel request if it's asking for files and we don't have them	2020-11-04 10:22:56 -07:00
io-wq.h	io_uring: unify fsize with def->work_flags	2020-10-20 16:03:13 -06:00
ioctl.c	fs: remove ksys_ioctl	2020-07-31 08:16:01 +02:00
Kconfig	tmpfs: support 64-bit inums per-sb	2020-08-07 11:33:24 -07:00
Kconfig.binfmt	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
kernel_read_file.c	fs/kernel_file_read: Add "offset" arg for partial reads	2020-10-05 13:37:04 +02:00
libfs.c	libfs: fix error cast of negative value in simple_attr_write()	2020-11-22 10:48:22 -08:00
locks.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
Makefile	Refactored code for 5.10:	2020-10-23 11:33:41 -07:00
mbcache.c
mount.h	proc/mounts: add cursor	2020-05-14 16:44:24 +02:00
mpage.c	fs: convert mpage_readpages to mpage_readahead	2020-06-02 10:59:07 -07:00
namei.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
namespace.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
no-block.c
nsfs.c	nsproxy: attach to namespaces via pidfds	2020-05-13 11:41:22 +02:00
open.c	exec: move S_ISREG() check earlier	2020-08-12 10:58:01 -07:00
pipe.c	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-11 11:11:35 -07:00
pnode.c	propagate_one(): mnt_set_mountpoint() needs mount_lock	2020-04-27 10:37:14 -04:00
pnode.h
posix_acl.c	vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK	2020-06-08 11:04:19 -07:00
proc_namespace.c	Add a "nosymfollow" mount option.	2020-08-27 16:06:47 -04:00
read_write.c	Refactored code for 5.10:	2020-10-23 11:33:41 -07:00
readdir.c	fs: remove ksys_getdents64	2020-07-31 08:16:00 +02:00
remap_range.c	vfs: move the remap range helpers to remap_range.c	2020-10-15 09:48:49 -07:00
select.c	fs: Replace zero-length array with flexible-array member	2020-10-29 17:22:59 -05:00
seq_file.c	seq_file: add seq_read_iter	2020-11-06 10:05:18 -08:00
signalfd.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
splice.c	io_uring-5.10-2020-10-24	2020-10-24 12:40:18 -07:00
stack.c
stat.c	fs: remove KSTAT_QUERY_FLAGS	2020-09-26 22:55:05 -04:00
statfs.c	Add a "nosymfollow" mount option.	2020-08-27 16:06:47 -04:00
super.c	vfs: move __sb_{start,end}_write* to fs.h	2020-11-10 16:53:11 -08:00
sync.c	overlayfs update for 5.8	2020-06-09 15:40:50 -07:00
timerfd.c
userfaultfd.c	mm: remove the now-unnecessary mmget_still_valid() hack	2020-10-16 11:11:22 -07:00
utimes.c	fs: expose utimes_common	2020-07-31 08:16:01 +02:00
xattr.c	fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr	2020-10-13 18:38:27 -07:00