linux

iv/linux

History

Filipe Manana 73e339e6ab btrfs: cache sharedness of the last few data extents during fiemap During fiemap we process all the file extent items of an inode, by their file offset order (left to right b+tree order), and then check if the data extent they point at is shared or not. Until now we didn't cache those results, we only did it for b+tree nodes/leaves since for each unique b+tree path we have access to hundreds of file extent items. However, it is also common to repeat checking the sharedness of a particular data extent in a very short time window, and the cases that lead to that are the following: 1) COW writes. If have a file extent item like this: [ bytenr X, offset = 0, num_bytes = 512K ] file offset 0 512K Then a 4K write into file offset 64K happens, we end up with the following file extent item layout: [ bytenr X, offset = 0, num_bytes = 64K ] file offset 0 64K [ bytenr Y, offset = 0, num_bytes = 4K ] file offset 64K 68K [ bytenr X, offset = 68K, num_bytes = 444K ] file offset 68K 512K So during fiemap we well check for the sharedness of the data extent with bytenr X twice. Typically for COW writes and for at least moderately updated files, we end up with many file extent items that point to different sections of the same data extent. 2) Writing into a NOCOW file after a snapshot is taken. This happens if the target extent was created in a generation older than the generation where the last snapshot for the root (the tree the inode belongs to) was made. This leads to a scenario like the previous one. 3) Writing into sections of a preallocated extent. For example if a file has the following layout: [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ] 0 1M After doing a 4K write into file offset 0 and another 4K write into offset 512K, we get the following layout: [ bytenr X, offset = 0, num_bytes = 4K, type = regular ] 0 4K [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ] 4K 512K [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ] 512K 516K [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ] 516K 1M So we end up with 4 consecutive file extent items pointing to the data extent at bytenr X. 4) Hole punching in the middle of an extent. For example if a file has the following file extent item: [ bytenr X, offset = 0, num_bytes = 8M ] 0 8M And then hole is punched for the file range [4M, 6M[, we our file extent item split into two: [ bytenr X, offset = 0, num_bytes = 4M ] 0 4M [ 2M hole, implicit or explicit depending on NO_HOLES feature ] 4M 6M [ bytenr X, offset = 6M, num_bytes = 2M ] 6M 8M Again, we end up with two file extent items pointing to the same data extent. 5) When reflinking (clone and deduplication) within the same file. This is probably the least common case of all. In cases 1, 2, 4 and 4, when we have multiple file extent items that point to the same data extent, their distance is usually short, typically separated by a few slots in a b+tree leaf (or across sibling leaves). For case 5, the distance can vary a lot, but it's typically the less common case. This change caches the result of the sharedness checks for data extents, but only for the last 8 extents that we notice that our inode refers to with multiple file extent items. Whenever we want to check if a data extent is shared, we lookup the cache which consists of doing a linear scan of an 8 elements array, and if we find the data extent there, we return the result and don't check the extent tree and delayed refs. The array/cache is small so that doing the search has no noticeable negative impact on the performance in case we don't have file extent items within a distance of 8 slots that point to the same data extent. Slots in the cache/array are overwritten in a simple round robin fashion, as that approach fits very well. Using this simple approach with only the last 8 data extents seen is effective as usually when multiple file extents items point to the same data extent, their distance is within 8 slots. It also uses very little memory and the time to cache a result or lookup the cache is negligible. The following test was run on non-debug kernel (Debian's default kernel config) to measure the impact in the case of COW writes (first example given above), where we run fiemap after overwriting 33% of the blocks of a file: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # Create the file full of 1M extents. xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar block_count=$((FILE_SIZE / 4096)) # Overwrite about 33% of the file blocks. overwrite_count=$((block_count / 3)) echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..." RANDOM=123 for ((i = 1; i <= $overwrite_count; i++)); do off=$(((RANDOM % block_count) * 4096)) xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null echo -ne "\r$i blocks overwritten..." done echo -e "\n" # Unmount and mount to clear all cached metadata. umount $MNT mount $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds" umount $MNT Result before applying this patch: fiemap took 128 milliseconds Result after applying this patch: fiemap took 92 milliseconds (-28.1%) The test is somewhat limited in the sense the gains may be higher in practice, because in the test the filesystem is small, so we have small fs and extent trees, plus there's no concurrent access to the trees as well, therefore no lock contention there. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2022-12-05 18:00:39 +01:00
..
tests	for-6.1-rc4-tag	2022-11-10 08:58:29 -08:00
acl.c
async-thread.c
async-thread.h	btrfs: remove unused typedefs get_extent_t and btrfs_work_func_t	2022-07-25 17:45:36 +02:00
backref.c	btrfs: cache sharedness of the last few data extents during fiemap	2022-12-05 18:00:39 +01:00
backref.h	btrfs: cache sharedness of the last few data extents during fiemap	2022-12-05 18:00:39 +01:00
block-group.c	btrfs: move btrfs_should_fragment_free_space into block-group.c	2022-12-05 18:00:37 +01:00
block-group.h	btrfs: move btrfs_should_fragment_free_space into block-group.c	2022-12-05 18:00:37 +01:00
block-rsv.c	btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY	2022-12-05 18:00:38 +01:00
block-rsv.h	btrfs: add KCSAN annotations for unlocked access to block_rsv->full	2022-09-26 12:28:02 +02:00
btrfs_inode.h	btrfs: move btrfs_print_data_csum_error into inode.c	2022-12-05 18:00:37 +01:00
check-integrity.c
check-integrity.h
compression.c	fs: fix leaked psi pressure state	2022-11-08 15:57:25 -08:00
compression.h	for-5.20-tag	2022-08-03 14:54:52 -07:00
ctree.c	btrfs: move btrfs_next_old_item into ctree.c	2022-12-05 18:00:37 +01:00
ctree.h	btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h	2022-12-05 18:00:37 +01:00
delalloc-space.c	btrfs: add the ability to use NO_FLUSH for data reservations	2022-09-29 17:08:28 +02:00
delalloc-space.h	btrfs: add the ability to use NO_FLUSH for data reservations	2022-09-29 17:08:28 +02:00
delayed-inode.c	btrfs: move flush related definitions to space-info.h	2022-12-05 18:00:37 +01:00
delayed-inode.h	btrfs: use delayed items when logging a directory	2022-09-26 12:27:57 +02:00
delayed-ref.c	btrfs: switch btrfs_block_rsv::full to bool	2022-07-25 17:45:40 +02:00
delayed-ref.h
dev-replace.c	btrfs: don't take a bio_counter reference for cloned bios	2022-09-26 12:27:58 +02:00
dev-replace.h	btrfs: add struct declarations in dev-replace.h	2022-09-26 12:28:07 +02:00
dir-item.c
discard.c
discard.h
disk-io.c	btrfs: move btrfs_get_block_group helper out of disk-io.h	2022-12-05 18:00:36 +01:00
disk-io.h	btrfs: move btrfs_get_block_group helper out of disk-io.h	2022-12-05 18:00:36 +01:00
export.c	btrfs: fix type of parameter generation in btrfs_get_dentry	2022-10-24 15:28:58 +02:00
export.h	btrfs: fix type of parameter generation in btrfs_get_dentry	2022-10-24 15:28:58 +02:00
extent_io.c	btrfs: move ulists to data extent sharedness check context	2022-12-05 18:00:39 +01:00
extent_io.h	btrfs: move extent io tree unrelated prototypes to their appropriate header	2022-09-26 12:28:04 +02:00
extent_map.c	btrfs: get the next extent map during fiemap/lseek more efficiently	2022-12-05 18:00:38 +01:00
extent_map.h	btrfs: get the next extent map during fiemap/lseek more efficiently	2022-12-05 18:00:38 +01:00
extent-io-tree.c	btrfs: cache the failed state when locking extents	2022-12-05 18:00:36 +01:00
extent-io-tree.h	btrfs: cache the failed state when locking extents	2022-12-05 18:00:36 +01:00
extent-tree.c	btrfs: fix tree mod log mishandling of reallocated nodes	2022-10-24 15:28:07 +02:00
file-item.c	btrfs: make can_nocow_extent nowait compatible	2022-09-29 17:08:26 +02:00
file.c	btrfs: skip unnecessary delalloc search during fiemap and lseek	2022-12-05 18:00:38 +01:00
free-space-cache.c	btrfs: move free space cachep's out of ctree.h	2022-12-05 18:00:37 +01:00
free-space-cache.h	btrfs: move free space cachep's out of ctree.h	2022-12-05 18:00:37 +01:00
free-space-tree.c	btrfs: get rid of block group caching progress logic	2022-09-26 12:27:58 +02:00
free-space-tree.h
inode-item.c	btrfs: move flush related definitions to space-info.h	2022-12-05 18:00:37 +01:00
inode-item.h
inode.c	btrfs: move free space cachep's out of ctree.h	2022-12-05 18:00:37 +01:00
ioctl.c	btrfs: free btrfs_path before copying subvol info to userspace	2022-11-15 17:15:45 +01:00
Kconfig
locking.c	btrfs: implement a nowait option for tree searches	2022-09-26 12:46:42 +02:00
locking.h	btrfs: implement a nowait option for tree searches	2022-09-26 12:46:42 +02:00
lzo.c	btrfs: replace kmap() with kmap_local_page() in lzo.c	2022-07-25 17:45:33 +02:00
Makefile	btrfs: move extent state init and alloc functions to their own file	2022-09-26 12:28:03 +02:00
misc.h	btrfs: convert the io_failure_tree to a plain rb_tree	2022-09-26 12:28:02 +02:00
ordered-data.c	btrfs: use cached_state for btrfs_check_nocow_lock	2022-12-05 18:00:36 +01:00
ordered-data.h	btrfs: use cached_state for btrfs_check_nocow_lock	2022-12-05 18:00:36 +01:00
orphan.c
print-tree.c
print-tree.h
props.c	btrfs: move flush related definitions to space-info.h	2022-12-05 18:00:37 +01:00
props.h
qgroup.c	btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()	2022-11-21 14:57:52 +01:00
qgroup.h	btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting	2022-09-26 12:28:01 +02:00
raid56.c	btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate	2022-12-05 18:00:38 +01:00
raid56.h	btrfs: properly abstract the parity raid bio handling	2022-09-26 12:27:59 +02:00
rcu-string.h
ref-verify.c
ref-verify.h
reflink.c	btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS	2022-09-26 12:28:05 +02:00
reflink.h
relocation.c	btrfs: move flush related definitions to space-info.h	2022-12-05 18:00:37 +01:00
root-tree.c	btrfs: simplify error handling at btrfs_del_root_ref()	2022-09-26 12:27:58 +02:00
scrub.c	btrfs: move BTRFS_MAX_MIRRORS into scrub.c	2022-12-05 18:00:37 +01:00
send.c	btrfs: send: avoid unaligned encoded writes when attempting to clone range	2022-11-21 14:41:41 +01:00
send.h	btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUG	2022-10-11 14:46:55 +02:00
space-info.c	btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY	2022-12-05 18:00:38 +01:00
space-info.h	btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY	2022-12-05 18:00:38 +01:00
struct-funcs.c	btrfs: remove redundant check in up check_setget_bounds	2022-07-25 17:45:33 +02:00
subpage.c	btrfs: convert process_page_range() to use filemap_get_folios_contig()	2022-09-11 20:26:03 -07:00
subpage.h
super.c	btrfs: move free space cachep's out of ctree.h	2022-12-05 18:00:37 +01:00
sysfs.c	btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()	2022-11-23 16:52:22 +01:00
sysfs.h
transaction.c	btrfs: move trans_handle_cachep out of ctree.h	2022-12-05 18:00:37 +01:00
transaction.h	btrfs: move trans_handle_cachep out of ctree.h	2022-12-05 18:00:37 +01:00
tree-checker.c	btrfs: tree-checker: check for overlapping extent items	2022-08-17 16:20:25 +02:00
tree-checker.h
tree-defrag.c
tree-log.c	btrfs: do not modify log tree while holding a leaf from fs tree locked	2022-11-23 16:52:15 +01:00
tree-log.h	btrfs: use delayed items when logging a directory	2022-09-26 12:27:57 +02:00
tree-mod-log.c
tree-mod-log.h
ulist.c
ulist.h
uuid-tree.c
verity.c	btrfs: send: add support for fs-verity	2022-09-26 12:27:55 +02:00
volumes.c	btrfs: zoned: initialize device's zone info for seeding	2022-11-07 14:35:24 +01:00
volumes.h	btrfs: zoned: initialize device's zone info for seeding	2022-11-07 14:35:24 +01:00
xattr.c	btrfs: check if root is readonly while setting security xattr	2022-08-22 18:06:30 +02:00
xattr.h
zlib.c	btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio()	2022-07-25 17:45:41 +02:00
zoned.c	btrfs: use kvcalloc in btrfs_get_dev_zone_info	2022-11-23 16:51:50 +01:00
zoned.h	btrfs: zoned: clone zoned device info when cloning a device	2022-11-07 14:35:21 +01:00
zstd.c	btrfs: zstd: replace kmap() with kmap_local_page()	2022-07-25 17:45:40 +02:00