linux

iv/linux

History

Filipe Manana b6e833567e btrfs: make hole and data seeking a lot more efficient The current implementation of hole and data seeking for llseek does not scale well in regards to the number of extents and the distance between the start offset and the next hole or extent. This is due to a very high algorithmic complexity. Often we also get reports of btrfs' hole and data seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link tag at the bottom). In order to better understand it, lets consider the case where the start offset is 0, we are seeking for a hole and the file size is 16G. Between file offset 0 and the first hole in the file there are 100K extents - this is common for large files, specially if we have compression enabled, since the maximum extent size is limited to 128K. The steps take by the main loop of the current algorithm are the following: 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which calls btrfs_get_extent(). This will first lookup for an extent map in the inode's extent map tree (a red black tree). If the extent map is not loaded in memory, then it will do a lookup for the corresponding file extent item in the subvolume's b+tree, create an extent map based on the contents of the file extent item and then add the extent map to the extent map tree of the inode; 2) The second iteration calls btrfs_get_extent_fiemap() again, this time with a start offset matching the end offset of the previous extent. Again, btrfs_get_extent() will first search the extent map tree, and if it doesn't find an extent map there, it will again search in the b+tree of the subvolume for a matching file extent item, build an extent map based on the file extent item, and add the extent map to to the extent map tree of the inode; 3) This repeats over and over until we find the first hole (when seeking for holes) or until we find the first extent (when seeking for data). If there no extent maps loaded in memory for each iteration, then on each iteration we do 1 extent map tree search, 1 b+tree search, plus 1 more extent map tree traversal to insert an extent map - plus we allocate memory for the extent map. On each iteration we are growing the size of the extent map tree, making each future search slower, and also visiting the same b+tree leaves over and over again - taking into account with the default leaf size of 16K we can fit more than 200 file extent items in a leaf - so we can visit the same b+tree leaf 200+ times, on each visit walking down a path from the root to the leaf. So it's easy to see that what we have now doesn't scale well. Also, it loads an extent map for every file extent item into memory, which is not efficient - we should add extents maps only when doing IO (writing or reading file data). This change implements a new algorithm which scales much better, and works like this: 1) We iterate over the subvolume's b+tree, visiting each leaf that has file extent items once and only once; 2) For any file extent items found, that don't represent holes or prealloc extents, it will not search the extent map tree - there's no need at all for that - an extent map is just an in-memory representation of a file extent item; 3) When a hole is found, or a prealloc extent, it will check if there's delalloc for its range. For this it will search for EXTENT_DELALLOC bits in the inode's io tree and check the extent map tree - this is for accounting for unflushed delalloc and for flushed delalloc (the period between running delalloc and ordered extent completion), respectively. This is similar to what the current implementation does when it finds a hole or prealloc extent, but without creating extent maps and adding them to the extent map tree in case they are not loaded in memory; 4) It never allocates extent maps, or adds extent maps to the inode's extent map tree. This not only saves memory and time (from the tree insertions and allocations), but also eliminates the possibility of -ENOMEM due to allocating too many extent maps. Part of this new code will also be used later for fiemap (which also suffers similar scalability problems). The following test example can be used to quickly measure the efficiency before and after this patch: $ cat test-seek-hole.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 16G file -> 131073 compressed extents. xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar # Leave a 1M hole at file offset 15G. xfs_io -c "fpunch 15G 1M" $MNT/foobar # Unmount and mount again, so that we can test when there's no # metadata cached in memory. umount $MNT mount -o compress=lzo $DEV $MNT # Test seeking for hole from offset 0 (hole is at offset 15G). start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata not cached)" echo start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata cached)" echo umount $MNT Before this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 176 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 17 milliseconds to seek first hole (metadata cached) After this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 43 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 13 milliseconds to seek first hole (metadata cached) That's about 4x faster when no metadata is cached and about 30% faster when all metadata is cached. In practice the differences may often be significantly higher, either due to a higher number of extents in a file or because the subvolume's b+tree is much bigger than in this example, where we only have one file. Link: https://lwn.net/Articles/718805/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2022-09-26 12:28:00 +02:00
..
tests	btrfs: remove use btrfs_remove_free_space_cache instead of variant	2022-09-26 12:27:58 +02:00
acl.c	btrfs: reserve correct number of items for inode creation	2022-05-16 17:03:08 +02:00
async-thread.c	btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue	2022-05-16 17:03:15 +02:00
async-thread.h	btrfs: remove unused typedefs get_extent_t and btrfs_work_func_t	2022-07-25 17:45:36 +02:00
backref.c	btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino	2022-07-25 17:45:36 +02:00
backref.h	btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino	2022-07-25 17:45:36 +02:00
block-group.c	btrfs: enhance unsupported compat RO flags handling	2022-09-26 12:28:00 +02:00
block-group.h	btrfs: get rid of block group caching progress logic	2022-09-26 12:27:58 +02:00
block-rsv.c	btrfs: don't save block group root into super block	2022-09-26 12:28:00 +02:00
block-rsv.h	btrfs: use enum for btrfs_block_rsv::type	2022-07-25 17:45:40 +02:00
btrfs_inode.h	btrfs: add optimized btrfs_ino() version for 64 bits systems	2022-07-25 17:45:41 +02:00
check-integrity.c	fs/btrfs: Use the enum req_op and blk_opf_t types	2022-07-14 12:14:32 -06:00
check-integrity.h	btrfs: check-integrity: split submit_bio from btrfsic checking	2022-05-16 17:03:12 +02:00
compression.c	btrfs: give struct btrfs_bio a real end_io handler	2022-09-26 12:27:59 +02:00
compression.h	for-5.20-tag	2022-08-03 14:54:52 -07:00
ctree.c	btrfs: fix lockdep splat with reloc root extent buffers	2022-08-17 16:19:12 +02:00
ctree.h	btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2	2022-09-26 12:28:00 +02:00
delalloc-space.c	btrfs: convert count_max_extents() to use fs_info->max_extent_size	2022-07-25 17:45:41 +02:00
delalloc-space.h
delayed-inode.c	btrfs: use delayed items when logging a directory	2022-09-26 12:27:57 +02:00
delayed-inode.h	btrfs: use delayed items when logging a directory	2022-09-26 12:27:57 +02:00
delayed-ref.c	btrfs: switch btrfs_block_rsv::full to bool	2022-07-25 17:45:40 +02:00
delayed-ref.h	btrfs: remove btrfs_delayed_extent_op::is_data	2022-05-16 17:17:31 +02:00
dev-replace.c	btrfs: don't take a bio_counter reference for cloned bios	2022-09-26 12:27:58 +02:00
dev-replace.h
dir-item.c	btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item	2022-05-16 17:03:07 +02:00
discard.c	btrfs: fix typos in comments	2021-06-22 14:11:57 +02:00
discard.h
disk-io.c	btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2	2022-09-26 12:28:00 +02:00
disk-io.h	btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2	2022-09-26 12:28:00 +02:00
export.c
export.h
extent_io.c	btrfs: give struct btrfs_bio a real end_io handler	2022-09-26 12:27:59 +02:00
extent_io.h	btrfs: move btrfs_bio allocation to volumes.c	2022-09-26 12:27:58 +02:00
extent_map.c	btrfs: assert we have a write lock when removing and replacing extent maps	2022-03-14 13:13:50 +01:00
extent_map.h	btrfs: defrag: don't use merged extent map for their generation check	2022-02-23 17:43:13 +01:00
extent-io-tree.h	btrfs: move btrfs_bio allocation to volumes.c	2022-09-26 12:27:58 +02:00
extent-tree.c	btrfs: get rid of block group caching progress logic	2022-09-26 12:27:58 +02:00
file-item.c	btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()	2022-09-26 12:27:54 +02:00
file.c	btrfs: make hole and data seeking a lot more efficient	2022-09-26 12:28:00 +02:00
free-space-cache.c	btrfs: remove use btrfs_remove_free_space_cache instead of variant	2022-09-26 12:27:58 +02:00
free-space-cache.h	btrfs: remove use btrfs_remove_free_space_cache instead of variant	2022-09-26 12:27:58 +02:00
free-space-tree.c	btrfs: get rid of block group caching progress logic	2022-09-26 12:27:58 +02:00
free-space-tree.h
inode-item.c	btrfs: make should_throttle loop local in btrfs_truncate_inode_items	2022-01-07 14:18:25 +01:00
inode-item.h	btrfs: add inode to truncate control	2022-01-07 14:18:24 +01:00
inode.c	btrfs: give struct btrfs_bio a real end_io handler	2022-09-26 12:27:59 +02:00
ioctl.c	btrfs: use fs_info->max_extent_size in get_extent_max_capacity()	2022-07-25 17:45:41 +02:00
Kconfig	btrfs: use generic Kconfig option for 256kB page size limit	2022-01-20 08:52:55 +02:00
locking.c	btrfs: fix lockdep splat with reloc root extent buffers	2022-08-17 16:19:12 +02:00
locking.h	btrfs: fix lockdep splat with reloc root extent buffers	2022-08-17 16:19:12 +02:00
lzo.c	btrfs: replace kmap() with kmap_local_page() in lzo.c	2022-07-25 17:45:33 +02:00
Makefile	Kbuild: add -Wno-shift-negative-value where -Wextra is used	2022-03-13 17:30:31 +09:00
misc.h	btrfs: use correct header for div_u64 in misc.h	2021-09-07 14:29:50 +02:00
ordered-data.c	btrfs: add lockdep annotations for the ordered extents wait event	2022-09-26 12:27:53 +02:00
ordered-data.h	btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished	2022-07-25 17:45:37 +02:00
orphan.c
print-tree.c	btrfs: unify the error handling pattern for read_tree_block()	2022-03-14 13:13:53 +01:00
print-tree.h
props.c	btrfs: remove the unnecessary result variables	2022-09-26 12:28:00 +02:00
props.h	btrfs: move common inode creation code into btrfs_create_new_inode()	2022-05-16 17:03:08 +02:00
qgroup.c	btrfs: fix race between quota enable and quota rescan ioctl	2022-09-26 12:27:58 +02:00
qgroup.h	btrfs: avoid blocking on space revervation when doing nowait dio writes	2022-05-16 17:03:10 +02:00
raid56.c	btrfs: properly abstract the parity raid bio handling	2022-09-26 12:27:59 +02:00
raid56.h	btrfs: properly abstract the parity raid bio handling	2022-09-26 12:27:59 +02:00
rcu-string.h
ref-verify.c	btrfs: stop accessing ->extent_root directly	2022-01-03 15:09:49 +01:00
ref-verify.h
reflink.c	btrfs: clean up chained assignments	2022-07-25 17:45:39 +02:00
reflink.h
relocation.c	btrfs: fix lockdep splat with reloc root extent buffers	2022-08-17 16:19:12 +02:00
root-tree.c	btrfs: simplify error handling at btrfs_del_root_ref()	2022-09-26 12:27:58 +02:00
scrub.c	btrfs: properly abstract the parity raid bio handling	2022-09-26 12:27:59 +02:00
send.c	btrfs: send: fix failures when processing inodes with no links	2022-09-26 12:27:57 +02:00
send.h	btrfs: send: add support for fs-verity	2022-09-26 12:27:55 +02:00
space-info.c	btrfs: dump all space infos if we abort transaction due to ENOSPC	2022-09-26 12:27:59 +02:00
space-info.h	btrfs: dump all space infos if we abort transaction due to ENOSPC	2022-09-26 12:27:59 +02:00
struct-funcs.c	btrfs: remove redundant check in up check_setget_bounds	2022-07-25 17:45:33 +02:00
subpage.c	btrfs: remove extent writepage address space operation	2022-07-25 17:45:37 +02:00
subpage.h	btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine	2022-05-16 17:03:11 +02:00
super.c	btrfs: enhance unsupported compat RO flags handling	2022-09-26 12:28:00 +02:00
sysfs.c	btrfs: remove the unnecessary result variables	2022-09-26 12:28:00 +02:00
sysfs.h
transaction.c	btrfs: don't save block group root into super block	2022-09-26 12:28:00 +02:00
transaction.h	btrfs: pass btrfs_fs_info for deleting snapshots and cleaner	2022-03-14 13:13:52 +01:00
tree-checker.c	btrfs: tree-checker: check for overlapping extent items	2022-08-17 16:20:25 +02:00
tree-checker.h	btrfs: tree-checker: check extent buffer owner against owner rootid	2022-05-16 17:03:09 +02:00
tree-defrag.c	btrfs: remove unnecessary extent root check in btrfs_defrag_leaves	2022-01-03 15:09:48 +01:00
tree-log.c	btrfs: simplify adding and replacing references during log replay	2022-09-26 12:27:57 +02:00
tree-log.h	btrfs: use delayed items when logging a directory	2022-09-26 12:27:57 +02:00
tree-mod-log.c	btrfs: fix race when picking most recent mod log operation for an old root	2021-04-20 19:27:17 +02:00
tree-mod-log.h	btrfs: add and use helper to get lowest sequence number for the tree mod log	2021-04-19 17:25:17 +02:00
ulist.c
ulist.h
uuid-tree.c	btrfs: drop the _nr from the item helpers	2022-01-03 15:09:43 +01:00
verity.c	btrfs: send: add support for fs-verity	2022-09-26 12:27:55 +02:00
volumes.c	btrfs: check superblock to ensure the fs was not modified at thaw time	2022-09-26 12:27:59 +02:00
volumes.h	btrfs: give struct btrfs_bio a real end_io handler	2022-09-26 12:27:59 +02:00
xattr.c	btrfs: check if root is readonly while setting security xattr	2022-08-22 18:06:30 +02:00
xattr.h
zlib.c	btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio()	2022-07-25 17:45:41 +02:00
zoned.c	btrfs: get rid of block group caching progress logic	2022-09-26 12:27:58 +02:00
zoned.h	btrfs: zoned: activate metadata block group on flush_space	2022-07-25 17:45:42 +02:00
zstd.c	btrfs: zstd: replace kmap() with kmap_local_page()	2022-07-25 17:45:40 +02:00