linux

iv/linux

History

Filipe Manana 0f8ce49821 btrfs: avoid inode logging during rename and link when possible During a rename or link operation, we need to determine if an inode was previously logged or not, and if it was, do some update to the logged inode. We used to rely exclusively on the logged_trans field of struct btrfs_inode to determine that, but that was not reliable because the value of that field is not persisted in the inode item, so it's lost when an inode is evicted and loaded back again. That led to several issues in the past, such as not persisting deletions (such as the case fixed by commit `803f0f64d1` ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions")), or resulting in losing a file after an inode eviction followed by a rename (commit `ecc64fab7d` ("btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction")), besides other issues. So the inode_logged() helper was introduced and used to determine if an inode was possibly logged before in the current transaction, with the caveat that it could return false positives, in the sense that even if an inode was not logged before in the current transaction, it could still return true, but never to return false in case the inode was logged. >From a functional point of view that is fine, but from a performance perspective it can introduce significant latencies to rename and link operations, as they will end up doing inode logging even when it is not necessary. Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete. With strace it could be observed that zypper was spending about 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently. Taking into account that installation/upgrade of some of the packages needed a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. While triggering the inode evictions if something outside btrfs' control, btrfs could still behave better by eliminating the false positives from the inode_logged() helper. So change inode_logged() to actually eliminate such false positives caused by inode eviction and when an inode was never logged since the filesystem was mounted, as both cases relate to when the logged_trans field of struct btrfs_inode has a value of zero. When it can not determine if the inode was logged based only on the logged_trans value, lookup for the existence of the inode item in the log tree - if it's there then we known the inode was logged, if it's not there then it can not have been logged in the current transaction. Once we determine if the inode was logged, update the logged_trans value to avoid future calls to have to search in the log tree again. Alternatively, we could start storing logged_trans in the on disk inode item structure (struct btrfs_inode_item) in the unused space it still has, but that would be a bit odd because: 1) We only care about logged_trans since the filesystem was mounted, we don't care about its value from a previous mount. Having it persisted in the inode item structure would not make the best use of the precious unused space; 2) In order to get logged_trans persisted before inode eviction, we would have to update the delayed inode when we finish logging the inode and update its logged_trans in struct btrfs_inode, which makes it a bit cumbersome since we need to check if the delayed inode exists, if not create it and populate it and deal with any errors (-ENOMEM mostly). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible The following test script mimics part of what the zypper tool does during package installations/upgrades. It does not triggers inode evictions, but it's similar because it triggers false positives from the inode_logged() helper, because the inodes have a logged_trans of 0, there's a log tree due to a fsync of an unrelated file and the directory inode has its last_trans field set to the current transaction: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Now do some change to an unrelated file and fsync it. # This is just to create a log tree to make sure that inode_logged() # does not return false when called against "testdir". xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo # Do some change to testdir. This is to make sure inode_logged() # will return true when called against "testdir", because its # logged_trans is 0, it was changed in the current transaction # and there's a log tree. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on a box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27837 ms NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%) NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%) NUM_FILES=5000, before patchset: 9127 ms NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%) NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%) NUM_FILES=2000, before patchset: 2528 ms NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%) NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%) NUM_FILES=1000, before patchset: 1085 ms NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%) NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%) Running dbench on the same physical machine with the following script: $ cat run-dbench.sh #!/bin/bash NUM_JOBS=$(nproc --all) DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 120 $NUM_JOBS umount $MNT Before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 3761352 0.032 143.843 Close 2762770 0.002 2.273 Rename 159304 0.291 67.037 Unlink 759784 0.207 143.998 Deltree 72 4.028 15.977 Mkdir 36 0.003 0.006 Qpathinfo 3409780 0.013 9.678 Qfileinfo 596772 0.001 0.878 Qfsinfo 625189 0.003 1.245 Sfileinfo 306443 0.006 1.840 Find 1318106 0.063 19.798 WriteX 1871137 0.021 8.532 ReadX 5897325 0.003 3.567 LockX 12252 0.003 0.258 UnlockX 12252 0.002 0.100 Flush 263666 3.327 155.632 Throughput 980.047 MB/sec 12 clients 12 procs max_latency=155.636 ms After whole patchset applied: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4195584 0.033 107.742 Close 3081932 0.002 1.935 Rename 177641 0.218 14.905 Unlink 847333 0.166 107.822 Deltree 118 5.315 15.247 Mkdir 59 0.004 0.048 Qpathinfo 3802612 0.014 10.302 Qfileinfo 666748 0.001 1.034 Qfsinfo 697329 0.003 0.944 Sfileinfo 341712 0.006 2.099 Find 1470365 0.065 9.359 WriteX 2093921 0.021 8.087 ReadX 6576234 0.003 3.407 LockX 13660 0.003 0.308 UnlockX 13660 0.002 0.114 Flush 294090 2.906 115.539 Throughput 1093.11 MB/sec 12 clients 12 procs max_latency=115.544 ms +11.5% throughput -25.8% max latency rename max latency -77.8% Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2022-03-14 13:13:48 +01:00
..
9p	Revert "fs/9p: search open fids first"	2022-01-30 22:13:37 +09:00
adfs	fs/adfs: remove unneeded variable make code cleaner	2022-01-20 08:52:55 +02:00
affs	affs: use bdev_nr_sectors instead of open coding it	2021-10-18 14:43:22 -06:00
afs	afs: Fix potential thrashing in afs writeback	2022-03-11 10:24:37 -08:00
autofs	autofs: fix wait name hash calculation in autofs_wait()	2021-10-20 21:09:02 -04:00
befs	isystem: ship and use stdarg.h	2021-08-19 09:02:55 +09:00
bfs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
btrfs	btrfs: avoid inode logging during rename and link when possible	2022-03-14 13:13:48 +01:00
cachefiles	cachefiles: Fix volume coherency attribute	2022-03-11 10:24:37 -08:00
ceph	ceph: set pool_ns in new inode layout for async creates	2022-01-26 20:17:50 +01:00
cifs	cifs: fix confusing unneeded warning message on smb2.1 and earlier	2022-02-16 17:16:49 -06:00
coda	coda: bump module version to 7.2	2021-11-09 10:02:51 -08:00
configfs	configfs: fix a race in configfs_{,un}register_subsystem()	2022-02-22 18:30:28 +01:00
cramfs	cramfs: use bdev_nr_bytes instead of open coding it	2021-10-18 14:43:22 -06:00
crypto	fscrypt: improve a few comments	2021-10-25 19:11:50 -07:00
debugfs	debugfs: lockdown: Allow reading debugfs files that are not world readable	2022-01-06 15:47:41 +01:00
devpts	fsnotify: fix fsnotify hooks in pseudo filesystems	2022-01-24 14:17:02 +01:00
dlm	driver core changes for 5.17-rc1	2022-01-12 11:11:34 -08:00
ecryptfs	fs: add is_idmapped_mnt() helper	2021-12-03 18:44:06 +01:00
efivarfs
efs
erofs	erofs: fix ztailpacking on > 4GiB filesystems	2022-03-02 21:58:45 +08:00
exfat	exfat: fix missing REQ_SYNC in exfat_update_bhs()	2022-01-10 11:00:04 +09:00
exportfs
ext2	fsdax: shift partition offset handling into the file systems	2021-12-04 08:58:54 -08:00
ext4	Various bug fixes for ext4 fast commit and inline data handling. Also	2022-02-06 10:34:45 -08:00
f2fs	Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the	2022-02-01 11:13:24 -08:00
fat	FAT: use io_schedule_timeout() instead of congestion_wait()	2022-01-20 08:52:54 +02:00
freevxfs
fscache	fscache: Fix the volume collision wait condition	2022-01-21 21:36:28 +00:00
fuse	fuse: fix pipe buffer lifetime for direct_io	2022-03-07 16:30:44 +01:00
gfs2	gfs2 fixes:	2022-02-11 11:36:32 -08:00
hfs	Merge branch 'akpm' (patches from Andrew)	2021-11-09 10:11:53 -08:00
hfsplus	hfsplus: use struct_group_attr() for memcpy() region	2022-01-20 08:52:54 +02:00
hostfs	hostfs: Fix writeback of dirty pages	2021-12-21 21:44:27 +01:00
hpfs	treewide: Replace open-coded flex arrays in unions	2021-10-18 12:28:53 -07:00
hugetlbfs	hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list()	2022-01-15 16:30:30 +02:00
iomap	xfs, iomap: limit individual ioend chain lengths in writeback	2022-01-26 09:19:20 -08:00
isofs	isofs: Fix out of bound access for corrupted isofs image	2021-10-19 12:51:02 +02:00
jbd2	Various bug fixes for ext4 fast commit and inline data handling. Also	2022-02-06 10:34:45 -08:00
jffs2	Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2022-01-17 05:49:30 +02:00
jfs	Just one JFS patch	2021-11-03 09:23:25 -07:00
kernfs	kernfs: prevent early freeing of root node	2021-12-03 14:36:21 +01:00
ksmbd	ksmbd: add support for key exchange	2022-02-04 00:12:22 -06:00
lockd	Notable bug fixes:	2022-02-02 10:14:31 -08:00
minix	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
netfs	netfs: Make ops->init_rreq() optional	2022-01-21 21:36:28 +00:00
nfs	NFS: Do not report writeback errors in nfs_getattr()	2022-02-16 15:15:22 -05:00
nfs_common	nfs: Fix kerneldoc warning shown up by W=1	2021-10-04 22:02:17 +01:00
nfsd	Notable bug fixes:	2022-02-09 09:56:57 -08:00
nilfs2	Merge branch 'akpm' (patches from Andrew)	2022-01-20 10:41:01 +02:00
nls
notify	fanotify: Fix stale file descriptor in copy_event_to_user()	2022-02-01 12:52:07 +01:00
ntfs	fs/ntfs/attrib.c: fix one kernel-doc comment	2022-01-15 16:30:24 +02:00
ntfs3	mm: remove cleancache	2022-01-22 08:33:38 +02:00
ocfs2	ocfs2: fix a deadlock when commit trans	2022-01-30 09:56:58 +02:00
omfs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
openpromfs
orangefs	orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc()	2021-12-31 14:37:43 -05:00
overlayfs	overlayfs fixes for 5.17-rc3	2022-02-01 11:23:02 -08:00
proc	proc: fix documentation and description of pagemap	2022-03-05 11:08:33 -08:00
pstore	pstore update for v5.17-rc1	2022-01-10 11:48:37 -08:00
qnx4	qnx4: work around gcc false positive warning bug	2021-09-21 08:36:48 -07:00
qnx6
quota	quota: make dquot_quota_sync return errors from ->sync_fs	2022-01-30 08:59:47 -08:00
ramfs	Merge branch 'akpm' (patches from Andrew)	2021-11-09 10:11:53 -08:00
reiserfs	reiserfs: don't use congestion_wait()	2021-11-18 11:52:22 +01:00
romfs
smbfs_common	smb3: add new defines from protocol specification	2022-01-18 16:50:47 -06:00
squashfs	squashfs: provide backing_dev_info in order to disable read-ahead	2022-01-15 16:30:24 +02:00
sysfs	fs/sysfs/dir.c: replace S_IRWXU\|S_IRUGO\|S_IXUGO with 0755 sysfs_create_dir_ns()	2021-10-05 16:35:05 +02:00
sysv	sysv: use BUILD_BUG_ON instead of runtime check	2021-11-09 10:02:52 -08:00
tracefs	tracefs: Set the group ownership in apply_options() not parse_options()	2022-02-25 21:05:04 -05:00
ubifs	ubifs: read-only if LEB may always be taken in ubifs_garbage_collect	2021-12-23 22:30:38 +01:00
udf	udf: Restore i_lenAlloc when inode expansion fails	2022-01-24 14:45:02 +01:00
ufs	isystem: ship and use stdarg.h	2021-08-19 09:02:55 +09:00
unicode	Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the	2022-02-01 11:13:24 -08:00
vboxsf	vboxfs: fix broken legacy mount signature checking	2021-09-27 11:26:21 -07:00
verity	fs-verity: fix signed integer overflow with i_size near S64_MAX	2021-09-22 10:56:34 -07:00
xfs	Bug fixes for 5.17-rc4:	2022-02-26 09:53:19 -08:00
zonefs	zonefs: add MODULE_ALIAS_FS	2021-12-17 16:56:35 +09:00
aio.c	aio: move aio sysctl to aio.c	2022-01-22 08:33:34 +02:00
anon_inodes.c	fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()	2021-09-19 22:35:37 -04:00
attr.c	fs: handle circular mappings correctly	2021-11-17 09:26:09 +01:00
bad_inode.c	vfs: add rcu argument to ->get_acl() callback	2021-08-18 22:08:24 +02:00
binfmt_aout.c	binfmt: a.out: Fix bogus semicolon	2021-09-05 10:15:05 -07:00
binfmt_elf_fdpic.c	coredump: Limit coredumps to a single thread group	2021-10-08 12:06:02 -05:00
binfmt_elf.c	binfmt_elf fix for v5.17-rc7	2022-03-01 11:31:37 -08:00
binfmt_flat.c	binfmt: remove in-tree usage of MAP_EXECUTABLE	2021-06-29 10:53:50 -07:00
binfmt_misc.c	Fix regression due to "fs: move binfmt_misc sysctl to its own file"	2022-02-09 09:50:02 -08:00
binfmt_script.c
buffer.c	fs/buffer: Convert __block_write_begin_int() to take a folio	2021-12-16 15:49:51 -05:00
char_dev.c
compat_binfmt_elf.c
coredump.c	fs/coredump: move coredump sysctls into its own file	2022-01-22 08:33:36 +02:00
d_path.c	d_path: fix Kernel doc validator complaining	2021-11-06 13:30:32 -07:00
dax.c	dax: remove the copy_from_iter and copy_to_iter methods	2021-12-18 08:04:53 -08:00
dcache.c	fs: move dcache sysctls to its own file	2022-01-22 08:33:36 +02:00
direct-io.c	fs: get rid of the res2 iocb->ki_complete argument	2021-10-25 10:36:24 -06:00
drop_caches.c	fs: drop_caches: fix skipping over shadow cache inodes	2021-09-03 09:58:10 -07:00
eventfd.c	eventfd: Export eventfd_wake_count to modules	2021-09-06 07:20:56 -04:00
eventpoll.c	eventpoll: simplify sysctl declaration with register_sysctl()	2022-01-22 08:33:35 +02:00
exec.c	fs/coredump: move coredump sysctls into its own file	2022-01-22 08:33:36 +02:00
fcntl.c	Merge branch 'akpm' (patches from Andrew)	2021-09-03 10:08:28 -07:00
fhandle.c
file_table.c	fs/file_table: fix adding missing kmemleak_not_leak()	2022-02-17 10:23:19 -08:00
file.c	fget: clarify and improve __fget_files() implementation	2021-12-13 10:55:30 -08:00
filesystems.c	fs: simplify get_filesystem_list / get_all_fs_names	2021-08-23 01:25:40 -04:00
fs_context.c	vfs: fs_context: fix up param length parsing in legacy_parse_param	2022-01-18 09:23:19 +02:00
fs_parser.c	fs_parse: allow parameter value to be empty	2021-12-09 14:09:36 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	fscache rewrite	2022-01-12 13:45:12 -08:00
fsopen.c
init.c
inode.c	fs: move inode sysctls to its own file	2022-01-22 08:33:35 +02:00
internal.h	fs/buffer: Convert __block_write_begin_int() to take a folio	2021-12-16 15:49:51 -05:00
io_uring.c	io_uring: disallow modification of rsrc_data during quiesce	2022-02-22 09:57:32 -07:00
io-wq.c	io_uring-5.17-2022-01-21	2022-01-21 16:07:21 +02:00
io-wq.h	Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2022-01-17 05:49:30 +02:00
ioctl.c	fs/ioctl: remove unnecessary __user annotation	2022-01-15 16:30:25 +02:00
Kconfig	ksmbd: add support for key exchange	2022-02-04 00:12:22 -06:00
Kconfig.binfmt	binfmt: remove support for em86 (alpha only)	2021-07-25 22:33:03 -07:00
kernel_read_file.c	vfs: check fd has read access in kernel_read_file_from_fd()	2021-10-18 20:22:03 -10:00
libfs.c	unicode: clean up the Kconfig symbol confusion	2022-01-20 19:57:24 -05:00
locks.c	fs: move locking sysctls where they are used	2022-01-22 08:33:36 +02:00
Makefile	Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the	2022-02-01 11:13:24 -08:00
mbcache.c
mount.h
mpage.c	mm: remove cleancache	2022-01-22 08:33:38 +02:00
namei.c	\n	2022-01-28 17:51:31 +02:00
namespace.c	fs: add kernel doc for mnt_{hold,unhold}_writers()	2022-02-14 08:35:32 +01:00
no-block.c
nsfs.c
open.c	fs: support mapped mounts of mapped filesystems	2021-12-05 10:28:57 +01:00
pipe.c	watch_queue: Fix lack of barrier/sync/lock between post and read	2022-03-11 10:17:13 -08:00
pnode.c
pnode.h
posix_acl.c	fs: support mapped mounts of mapped filesystems	2021-12-05 10:28:57 +01:00
proc_namespace.c	fs: add is_idmapped_mnt() helper	2021-12-03 18:44:06 +01:00
read_write.c	fs: remove leftover comments from mandatory locking removal	2021-10-26 12:20:50 -04:00
readdir.c
remap_range.c	fs: Convert vfs_dedupe_file_range_compare to folios	2022-01-08 00:28:41 -05:00
select.c	select: Fix indefinitely sleeping task in poll_schedule_timeout()	2022-01-11 09:03:05 -08:00
seq_file.c	seq_file: move seq_escape() to a header	2021-11-09 10:02:52 -08:00
signalfd.c	Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2022-01-17 05:49:30 +02:00
splice.c
stack.c
stat.c	fs: add generic helper for filling statx attribute flags	2021-08-17 11:47:43 +02:00
statfs.c
super.c	vfs: make freeze_super abort when sync_filesystem returns error	2022-01-30 08:59:47 -08:00
sync.c	vfs: make sync_filesystem return errors from ->sync_fs	2022-01-30 08:59:47 -08:00
sysctls.c	fs: move namespace sysctls and declare fs base directory	2022-01-22 08:33:36 +02:00
timerfd.c	timerfd: Provide timerfd_resume()	2021-08-10 17:57:22 +02:00
userfaultfd.c	mm: refactor vm_area_struct::anon_vma_name usage code	2022-03-05 11:08:32 -08:00
utimes.c
xattr.c