linux

iv/linux

Author	SHA1	Message	Date
David Sterba	882681ac98	btrfs: pass btrfs_inode to btrfs_submit_bio_start The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:52 +01:00
David Sterba	da67daab8d	btrfs: switch async_submit_bio::inode to btrfs_inode The async bio submit is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	ad65ecf30b	btrfs: simplify btree_submit_bio_start and btrfs_submit_bio_start parameters After previous patches the unused parameters can be removed from btree_submit_bio_start and btrfs_submit_bio_start as they don't need to conform to the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	ab2072b292	btrfs: change how submit bio callback is passed to btrfs_wq_submit_bio There's a callback function parameter for btrfs_wq_submit_bio that can be one of: metadata, buffered data, direct io data. The callback abstraction is unnecessary as we have all functions available. Replace the parameter with a command that leads to a direct call in run_one_async_start. The called functions can be then simplified and we can also remove the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	7920b773bd	btrfs: drop parameter compression_type from btrfs_submit_dio_repair_bio Compression and direct io don't work together so the compression parameter can be dropped after previous patch that changed the call to direct. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	19af6a7d34	btrfs: change how repair action is passed to btrfs_repair_one_sector There's a function pointer passed to btrfs_repair_one_sector that will submit the right bio for repair. However there are only two callbacks, for buffered and for direct IO. This can be simplified to a bool-based switch and call either function, indirect calls in this case is an unnecessary abstraction. This allows to remove the submit_bio_hook_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	961f5b8bf4	btrfs: convert btrfs_block_group::seq_zone to runtime flag In zoned mode the sequential status of zone can be also tracked in the runtime flags of block group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	0d7764ff58	btrfs: convert btrfs_block_group::needs_free_space to runtime flag We already have flags in block group to track various status bits, convert needs_free_space as well and reduce size of btrfs_block_group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	fd463ac461	btrfs: zoned: use helper to check a power of two zone size We have a 64bit compatible helper to check if a value is a power of two, use it instead of open coding it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	9e5e6d4e2e	btrfs: zlib: use copy_page for full page copy The copy_page helper may use an optimized version for full page copy (eg. on s390 there's a special instruction for that), there's one more left to convert. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
Filipe Manana	e2a0416577	btrfs: send: bump the extent reference count limit for backref walking After the previous patchset which is comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source we have now much better performance when doing backref walking in the send code, so we can increase the current limit from 64 to 1024 references. This limit is still a bit conservative because there are still edge cases where backref walking will be too slow and spend a lot of cpu time, some IO reading b+tree nodes/leaves and memory. The goal is to eventually get rid of any limit, but for now bump it as it benefits users with extents shared more than 64 times and up to 1024 times, allowing for more deduplication at the destination without having to run a dedupe tool after a receive. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
Filipe Manana	adf0241868	btrfs: send: skip resolution of our own backref when finding clone source When doing backref walking to determine a source range to clone from, it is worthless to collect and resolve our own data backref, as we can't obviously use it as a clone source and it represents the range we want to clone into. Collecting the backref implies doing the extra work to resolve it, doing the search for a file extent item in a subvolume tree, etc. Skipping the data backref is valid as long as we only have the send root as the single clone root, otherwise the leaf with the file extent item may be accessible from another clone root due to shared subtrees created by snapshots, and therefore we have to collect the backref and resolve it. So add a callback to the backref walking code to guide it to skip data backrefs. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source The following test was run on non-debug kernel (Debian's default kernel config) before and after applying the patchset: $ cat test-send-many-shared-extents.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT num_files=50000 num_clones_per_file=50 for ((i = 1; i <= $num_files; i++)); do xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null echo -ne "\r$i files created..." done echo btrfs subvolume snapshot -r $MNT $MNT/snap1 cloned=0 for ((i = 1; i <= $num_clones_per_file; i++)); do for ((j = 1; j <= $num_files; j++)); do cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i} cloned=$((cloned + 1)) echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations" done done echo btrfs subvolume snapshot -r $MNT $MNT/snap2 # Unmount and mount again to clear all cached metadata (and data). umount $DEV mount $DEV $MNT start=$(date +%s%N) btrfs send $MNT/snap2 > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000000 )) echo -e "\nFull send took $dur seconds" # Unmount and mount again to clear all cached metadata (and data). umount $DEV mount $DEV $MNT start=$(date +%s%N) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000000 )) echo -e "\nIncremental send took $dur seconds" umount $MNT Before applying the patchset: (...) Full send took 1108 seconds (...) Incremental send took 1135 seconds After applying the whole patchset: (...) Full send took 268 seconds (-75.8%) (...) Incremental send took 316 seconds (-72.2%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	f73853c716	btrfs: send: avoid double extent tree search when finding clone source At find_extent_clone() we search twice for the extent item corresponding to the data extent that the current file extent items points to: 1) Once with a call to extent_from_logical(); 2) Once again during backref walking, through iterate_extent_inodes() which eventually leads to find_parent_nodes() where we will search again the extent tree for the same extent item. The extent tree can be huge, so doing this one extra search for every extent we want to send adds up and it's expensive. The first call is there since the send code was introduced and it accomplishes two things: 1) Check that the extent is flagged as a data extent in the extent tree. But it can not be anything else, otherwise we wouldn't have a file extent item in the send root pointing to it. This was probably added to catch bugs in the early days where send was yet too young and the interaction with everything else was far from perfect; 2) Check how many direct references there are on the extent, and if there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the backred walking as it may take too long and slowdown send. So improve on this by having a callback in the backref walking code that is called when it finds the extent item in the extent tree, and have those checks done in the callback. When the callback returns anything different from 0, it stops the backref walking code. This way we do a single search on the extent tree for the extent item of our data extent. Also, before this change we were only checking the number of references on the data extent against SEND_MAX_EXTENT_REFS, but after starting backref walking we will end up resolving backrefs for extent buffers in the path from a leaf having a file extent item pointing to our data extent, up to roots of trees from which the extent buffer is accessible from, due to shared subtrees resulting from snapshoting. We were therefore allowing for the possibility for send taking too long due to some node in the path from the leaf to a root node being shared too many times. After this change we check for reference counts being greater than SEND_MAX_EXTENT_REFS for both data extents and metadata extents. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	88ffb665c8	btrfs: send: skip unnecessary backref iterations When looking for a clone source for an extent, we are iterating over all the backreferences for an extent. This is often a waste of time, because once we find a good clone source we could stop immediately instead of continuing backref walking, which is expensive. Basically what happens currently is this: 1) Call iterate_extent_inodes() to iterate over all the backreferences; 2) It calls btrfs_find_all_leafs() which in turn calls the main function to walk over backrefs and collect them - find_parent_nodes(); 3) Then we collect all the references for our target data extent from the extent tree (and delayed refs if any), add them to the rb trees, resolve all the indirect backreferences and search for all the file extent items in fs trees, building a list of inodes for each one of them (struct extent_inode_elem); 4) Then back at iterate_extent_inodes() we find all the roots associated to each found leaf, and call the callback __iterate_backrefs defined at send.c for each inode in the inode list associated to each leaf. Some times one the first backreferences we find in a fs tree is optimal to satisfy the clone operation that send wants to perform, and in that case we could stop immediately and avoid resolving all the remaining indirect backreferences (search fs trees for the respective file extent items, etc). This possibly if when we find a fs tree leaf with a file extent item we are able to know what are all the roots that can lead to the leaf - this is now possible after the previous patch in the series that adds a cache that maps leaves to a list of roots. So we can now shortcircuit backref walking during send, by having the callback we pass to iterate_extent_inodes() to be called when we find a file extent item for an indirect backreference, and have it return a special value when it found a suitable backreference and it does not need to look for more backreferences. This change does that. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	66d04209e5	btrfs: send: cache leaf to roots mapping during backref walking During a send operation, when doing backref walking to determine which inodes/offsets/roots we can clone from, the most repetitive and expensive step is to map each leaf that has file extent items pointing to the target data extent to the IDs of the roots from which the leaves are accessible, which happens at iterate_extent_inodes(). That step requires finding every parent node of a leaf, then the parent of each parent, and so on until we reach a root node. So it's a naturally expensive operation, and repetitive because each leaf can have hundreds of file extent items (for a nodesize of 16K, that can be slightly over 200 file extent items). There's also temporal locality, as we process all file extent items from a leave before moving the next leaf. This change caches the mapping of leaves to root IDs, to avoid repeating those computations over and over again. The cache is limited to a maximum of 128 entries, with each entry being a struct with a size of 128 bytes, so the maximum cache size is 16K plus any nodes internally allocated by the maple tree that is used to index pointers to those structs. The cache is invalidated whenever we detect relocation happened since we started filling the cache, because if relocation happened then extent buffers for leaves and nodes of the trees used by a send operation may have been reallocated. This cache also allows for another important optimization that is introduced in the next patch in the series. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	fa104a8790	btrfs: constify ulist parameter of ulist_next() The ulist_next() iterator function does not need to change the given ulist so make it const. This will allow the next patch in the series to pass a ulist to a function that does not need, and should not, modify the ulist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	1baea6f18a	btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() At iterate_extent_inodes() we collect a ulist of leaves for a given extent with a call to btrfs_find_all_leafs() and then we enter a loop where we iterate over all the collected leaves. Each iteration of that loop does a call to btrfs_find_all_roots_safe(), to determine all roots from which a leaf is accessible, and that results in allocating and releasing a ulist to store the root IDs. Instead of allocating and releasing the roots ulist on every iteration, allocate a ulist before entering the loop and keep using it on each iteration, reinitializing the ulist at the end of each iteration. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	a2c8d27e5e	btrfs: use a structure to pass arguments to backref walking functions The public backref walking functions have quite a lot of arguments that are passed down the call stack to find_parent_nodes(), the core function of the backref walking code. The next patches in series will need to add even arguments to these functions that should be passed not only to find_parent_nodes(), but also to other functions used by the later (directly or even lower in the call stack). So create a structure to hold all these arguments and state used by the main backref walking function, find_parent_nodes(), and use it as the argument for the public backref walking functions iterate_extent_inodes(), btrfs_find_all_leafs() and btrfs_find_all_roots(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	6ce6ba5344	btrfs: use a single argument for extent offset in backref walking functions The interface for find_parent_nodes() has two extent offset related arguments: 1) One u64 pointer argument for the extent offset; 2) One boolean argument to tell if the extent offset should be ignored or not. These are confusing, becase the extent offset pointer can be NULL and in some cases callers pass a NULL value as a way to tell the backref walking code to ignore offsets in file extent items (and simply consider all file extent items that point to the target data extent). The boolean argument was added in commit `c995ab3cda` ("btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents"), but it was never really necessary, it was enough if it could find a way to get a NULL value passed to the "extent_item_pos" argument of find_parent_nodes(). The arguments are also passed to functions called by find_parent_nodes() and respective helper functions, which further makes everything more complicated than needed. Then we have several backref walking related functions that end up calling find_parent_nodes(), either directly or through some other function that they call, and for many we have to use an "extent_item_pos" (u64) argument and a boolean "ignore_offset" argument too. This is confusing and not really necessary. So use a single argument to specify the extent offset, as a simple u64 and not as a pointer, but using a special value of (u64)-1, defined as a documented constant, to indicate when the extent offset should be ignored. This is also preparation work for the upcoming patches in the series that add other arguments to find_parent_nodes() and other related functions that use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	c7499a64dc	btrfs: send: optimize clone detection to increase extent sharing Currently send does not do the best decisions when it comes to decide between multiple clone sources, which results in clone operations for partial extent ranges, which has the following disadvantages: 1) We get less shared extents at the destination; 2) We have to read more data during the send operation and emit more write commands. Besides not being optimal behaviour, it also breaks user expectations and is often reported by users, with a recent example in the Link tag at the bottom of this change log. Part of the reason for this non-optimal behaviour is that the backref walking code does not provide information about the length of the file extent items that were found for each backref, so send is blind about which backref is the best to chose as a cloning source. The other existing reasons are just silliness, namely always prefering the inode with the lowest number when multiple are found for the same root and when we can clone from multiple roots, always prefer the send root over any of the other clone roots. This does not make any sense since any inode or root is fine and as good as any other inode/root. Fix this by making backref walking pass information about the number of bytes referenced by each file extent item and then have send's backref callback pick the inode with the highest number of bytes for each root. Finally select the root from which we can clone more bytes from. Example reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo cp --reflink=always $MNT/foo $MNT/bar cp --reflink=always $MNT/foo $MNT/baz sync # Overwrite the second half of file foo. xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo sync echo echo "* fiemap in the original filesystem " echo xfs_io -c "fiemap -v" $MNT/foo xfs_io -c "fiemap -v" $MNT/bar xfs_io -c "fiemap -v" $MNT/baz echo btrfs filesystem du $MNT btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send -f /tmp/send_stream $MNT/snap umount $MNT mkfs.btrfs -f $DEV &> /dev/null mount $DEV $MNT btrfs receive -f /tmp/send_stream $MNT echo echo " fiemap in the new filesystem " echo xfs_io -r -c "fiemap -v" $MNT/snap/foo xfs_io -r -c "fiemap -v" $MNT/snap/bar xfs_io -r -c "fiemap -v" $MNT/snap/baz echo btrfs filesystem du $MNT rm -f /tmp/send_stream rm -f /tmp/snap.fssum umount $MNT Before this change: $ ./test.sh (...) fiemap in the original filesystem * /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap * fiemap in the new filesystem * /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 32768..34815 2048 0x1 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 1.00MiB - /mnt/sdi/snap/bar 2.00MiB 1.00MiB - /mnt/sdi/snap/baz 6.00MiB 2.00MiB - /mnt/sdi/snap 6.00MiB 2.00MiB 2.00MiB /mnt/sdi We end up with two 1M extents that are not shared for files bar and baz. After this change: $ ./test.sh (...) * fiemap in the original filesystem * /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap * fiemap in the new filesystem * /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 0.00B - /mnt/sdi/snap/bar 2.00MiB 0.00B - /mnt/sdi/snap/baz 6.00MiB 0.00B - /mnt/sdi/snap 6.00MiB 0.00B 3.00MiB /mnt/sdi Now there's a much better sharing, files bar and baz share 1M of the extent of file foo and the second extent of files bar and baz is shared between themselves. This will later be turned into a test case for fstests. Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	22a3c0ac8e	btrfs: send: avoid unnecessary backref lookups when finding clone source At find_extent_clone(), unless we are given an inline extent, a file extent item that represents hole or an extent that starts beyond the i_size, we always do backref walking to look for clone sources, unless if we have more than SEND_MAX_EXTENT_REFS (64) known references on the extent. However if we know we only have one reference in the extent item and only one clone source (the send root), then it's pointless to do the backref walking to search for clone sources, as we can't clone from any other root. So skip the backref walking in that case. The following test was run on a non-debug kernel (Debian's default kernel config): $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create an extent tree that's not too small and none of the # extents is shared. for ((i = 1; i <= 50000; i++)); do xfs_io -f -c "pwrite 0 4K" $MNT/file_$i > /dev/null echo -ne "\r$i files created..." done echo btrfs subvolume snapshot -r $MNT $MNT/snap start=$(date +%s%N) btrfs send $MNT/snap > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nsend took $dur milliseconds" umount $MNT Before this change: send took 5389 milliseconds After this change: send took 4519 milliseconds (-16.1%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:50 +01:00
Filipe Manana	344174a1a6	btrfs: send: drop unnecessary backref context field initializations At find_extent_clone() we are initializing to zero the 'found_itself' and 'found' fields of the backref context before we use it but we have already initialized the structure to zeroes when we declared it on stack, so it's pointless to initialize those fields and they are unnecessarily increasing the object text size with two "mov" instructions (x86_64). Similarly make the 'extent_len' initialization more clear by using an if- -then-else instead of a double assignment to it in case the extent's end crosses the i_size boundary. Before this change: $ size fs/btrfs/send.o text data bss dec hex filename 68694 4252 16 72962 11d02 fs/btrfs/send.o After this change: $ size fs/btrfs/send.o text data bss dec hex filename 68678 4252 16 72946 11cf2 fs/btrfs/send.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Filipe Manana	d3f41317f0	btrfs: send: update comment at find_extent_clone() We have this unclear comment at find_extent_clone() about extents starting at a file offset greater than or equals to the i_size of the inode. It's not really informative and it's misleading, since it mentions the author found such extents with snapshots and large files. Such extents are a result of fallocate with FALLOC_FL_KEEP_SIZE and there is no relation to snapshots or large files (all write paths update the i_size before inserting a new file extent item). So update the comment to be precise about it and why we don't bother looking for clone sources in that case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Filipe Manana	61ce908a3c	btrfs: send: avoid unnecessary path allocations when finding extent clone When looking for an extent clone, at find_extent_clone(), we start by allocating a path and then check for cases where we can't have clones and exit immediately in those cases. It's a waste of time to allocate the path before those cases, so reorder the logic so that we check for those cases before allocating the path. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	1a1a285139	btrfs: remove the unused endio_raid56_workers and btrfs_raid_bio::end_io_work Since we have switched all raid56 workload to submit-and-wait method, there is no use for btrfs_fs_info::endio_raid56_workers workqueue and btrfs_raid_bio::end_io_work. Remove them to save some memory. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	6bfd0133be	btrfs: raid56: switch scrub path to use a single function This switch involves the following changes: - Make finish_parity_scrub() only to submit the write bios It will no longer call rbio_orig_end_io(), and now it will return error. - Add a new helper, recover_scrub_rbio(), to handle recovery It's just doing extra scrub related checks, and then call recover_sectors(). - Rename raid56_parity_scrub_stripe() to scrub_rbio() - Rename scrub_parity_work() to scrub_rbio_work_locked() To follow the existing naming scheme. - Delete unused functions Including: * finish_rmw() * raid_write_end_io() * raid56_bio_end_io() * __raid_recover_end_io() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	cb3450b7d7	btrfs: raid56: extract scrub read bio list assembly code into a helper Just like what we did for write/recovery, also extract the read bio assembly code into a helper for scrub. The difference between the three are: - rmw_assemble_read_bios() only submit reads for missing sectors Thus it will skip cached sectors, but will also read sectors which is not covered by any full stripe. (For cache usage) - recover_assemble_read_bios() reads every sector which has not failed - scrub_assemble_read_bios() has extra check for vertical stripes It's mostly the same as rmw_assemble_read_bios(), but will skip sectors which is not covered by a vertical stripe. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	93723095b5	btrfs: raid56: switch write path to rmw_rbio() This includes the following changes: - Implement new raid_unplug() functions Now we don't need a workqueue to run the plug, as all our work is just queue rmw_rbio_work() call, which can be executed without sleep. - Implement a rmw_rbio_work_locked() helper This is for unlock_stripe(), which is already holding the full stripe lock. - Remove all the old functions This should already shows how complex the old functions are, as we ended up removing the following functions: * rmw_work() * validate_rbio_for_rmw() * raid56_rmw_end_io_work() * raid56_rmw_stripe() * full_stripe_write() * partial_stripe_write() * __raid56_parity_write() * run_plug() * unplug_work() * btrfs_raid_unplug() * rmw_work() * __raid56_parity_recover() * raid_recover_end_io_work() - Unexport rmw_rbio() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	5eb30ee26f	btrfs: raid56: introduce the main entrance for RMW path The new entrance will be called rmw_rbio(), it will have a streamlined workflow by using submit-and-wait method. Thus there will be no weird jumps between tons of functions, thus way more reader friendly, and will make later expansion easier, as it's now a straight workflow, the timing is way more clear. Unfortunately we can not yet migrate the RMW path to use this new entrance as we still need extra work to address the plug and unlock_stripe() function. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	6486d21c99	btrfs: raid56: extract rwm write bios assembly into a helper The helper will be later used to refactor the rmw write path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	509c27aa2f	btrfs: raid56: extract the rmw bio list build code into a helper The helper will later be used to refactor the whole RMW path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	d817ce35d2	btrfs: raid56: switch recovery path to a single function Currently btrfs uses end_io functions to jump between different stages of recovery. For example, we go the following different functions: - raid56_bio_end_io() This handles the read for all the sectors (except the missing device). - __raid_recover_end_io() This does the real work, it's called inside the delayed work function raid_recover_end_io_work(). This one recovery path involves at least 3 different functions, which is a big burden for readers. This patch will change the behavior by: - Introduce a unified recovery entrance, recover_rbio() - Use submit-and-wait method So the workflow is not interrupted by the endio function jump. This doesn't bring performance change, but reduce the burden for reviewers. - Run the main function in the rmw_workers workqueue Now raid56_parity_recover() only needs to setup the work, and queue the work using start_async_work(). Now readers only need to do one function jump (start_async_work()) to find out the main entrance of recovery path. Furthermore, recover_rbio() function can easily be reused by other paths. The old recovery path is still utilized by degraded write path. It will be cleaned up when we have migrated the write path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Qu Wenruo	ec936b0354	btrfs: raid56: extract sector recovery code into a helper This includes extra changes: - The allocation for unmap_array[] and pointers[] Now we allocate them in one go, and free them together. - Remove @err Use errno_to_blk_status(ret) instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Qu Wenruo	d31968d9b6	btrfs: raid56: extract the recovery bio list build code into a helper This new helper will be also utilized in the incoming refactor of recovery path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Qu Wenruo	30e3c897f4	btrfs: raid56: extract the pq generation code into a helper Currently finish_rmw() will update the P/Q stripes before submitting the writes. It's done behind a for(;;) loop, it's a little congested indent-wise, so extract the code into a helper called generate_pq_vertical(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Qu Wenruo	9c5ff9b42c	btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() This refactor includes the following behavior change first: - Don't error out if only P/Q is corrupted The old code will directly error out if only P/Q is corrupted. Although it is an logical error if we go into rebuild path with only P/Q corrupted, there is no need to error out. Just skip the rebuild and return the already good data. Then comes the following refactor which shouldn't cause behavior changes: - Introduce a helper to do vertical stripe recovery This not only reduce one indent level, but also paves the road for later data checksum verification in RMW cycles. - Sort rbio->faila/b before recovery So we don't need to do the same swap every vertical stripe - Replace a BUG_ON() with ASSERT() Or checkpatch won't let me pass. - Mark recovered sectors uptodate after the recover loop - Do the cleanup for pointers unconditionally We only need to initialize @pointers and @unmap_array to NULL, so we can safely free them unconditionally. - Mark the repaired sector uptodate in recover_vertical() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
David Sterba	ee5f017dcc	btrfs: merge struct extent_page_data to btrfs_bio_ctrl The two structures appear on the same call paths, btrfs_bio_ctrl is embedded in extent_page_data and we pass bio_ctrl to some functions. After merging there are fewer indirections and we have only one control structure. The packing remains same. The btrfs_bio_ctrl was selected as the target structure as the operation is closer to bio processing. Structure layout: struct btrfs_bio_ctrl { struct bio * bio; /* 0 8 / int mirror_num; / 8 4 / enum btrfs_compression_type compress_type; / 12 4 / u32 len_to_stripe_boundary; / 16 4 / u32 len_to_oe_boundary; / 20 4 / btrfs_bio_end_io_t end_io_func; / 24 8 / bool extent_locked; / 32 1 / bool sync_io; / 33 1 / / size: 40, cachelines: 1, members: 8 / / padding: 6 / / last cacheline: 40 bytes */ }; Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
David Sterba	8ec8519b47	btrfs: switch extent_page_data bit fields to bools The semantics of the two members is a boolean, so change the type accordingly. We have space in extent_page_data due to alignment there's no change in size. Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
David Sterba	428c8e0310	btrfs: simplify percent calculation helpers, rename div_factor The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Filipe Manana	20af93d97f	btrfs: update stale comment for nowait direct IO writes If when doing a direct IO write we need to fallback to buffered IO, we this comment at btrfs_direct_write() that says we can't directly fallback to buffered IO if we have a NOWAIT iocb, because we have no support for NOWAIT buffered writes. That is not true anymore, as support for NOWAIT buffered writes was added recently in commit `926078b21d` ("btrfs: enable nowait async buffered writes"). However we still can't fallback to a buffered write in case we have a NOWAIT iocb, because we'll need to flush delalloc and wait for it to complete after doing the buffered write, and that can block for several reasons, the main reason being waiting for IO to complete. So update the comment to mention all that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
David Sterba	c30ff698da	btrfs: fix SPDX comment in tree-mod-log.h The header files should use the /* */ comment style, introduced in commit `f3a84ccd28` ("btrfs: move the tree mod log code into its own file"). Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Qu Wenruo	a982fc8220	btrfs: extract the inline extent read code into its own function Currently we have inline extent read code behind two levels of indentation, factor them them out into a new function, read_inline_extent(), to make it a little easier to read. Since we're here, also remove @extent_offset and @pg_offset arguments from uncompress_inline() function, as it's not possible to have inline extents at non-inline file offset. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Qu Wenruo	280f15cb96	btrfs: remove new_inline argument from btrfs_extent_item_to_extent_map() The argument @new_inline changes the following members of extent_map: - em->compress_type - EXTENT_FLAG_COMPRESSED of em->flags However neither members makes a difference for inline extents: - Inline extent read never use above em members As inside btrfs_get_extent() we directly use the file extent item to do the read. - Inline extents are never to be split Thus code really needs em->compress_type or that flag will never be executed on inlined extents. (btrfs_drop_extent_cache() would be one example) - Fiemap no longer relies on extent maps Recent fiemap optimization makes fiemap to search subvolume tree directly, without using any extent map at all. Thus those members make no difference for inline extents any more. Furthermore such exception without much explanation is really a source of confusion. Thus this patch will completely remove the argument, and always set the involved members, unifying the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Qu Wenruo	a196a8944f	btrfs: do not reset extent map members for inline extents read Currently for inline extents read inside btrfs_get_extent(), we will reset several extent map members: - em->start Reset to extent_start, which is completely unnecessary. The extent_start and em->start should have already be zero, ensured by tree-checker already. - em->len Reset the round_up(copy_size, fs_info->sectorsize), which is again unnecessary. - em->orig_block_len Reset to em->len (sectorsize), while it is originally unset from btrfs_extent_item_to_extent_map(). This makes no difference, as all extent map handling paths will ignore the orig_block_len if they found it's an inlined extent. Such inline extent orig_block_len ignoring examples can be found in btrfs_drop_extent_cache(). - em->orig_start Reset to em->start (0), while it is originally set to EXTENT_MAP_HOLE. This makes no difference either, as all extent map handling paths will ignore the em->orig_start if they found it's an inline extent. Thus all these em members resetting are unnecessary. Replace them with ASSERT()s checking the only two members (block_start and length) that make sense. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Qu Wenruo	affc542433	btrfs: make inline extent read calculation much simpler Currently we calculate inline extent read in a way that inline extent can start at non-zero offset. This is consistent with the inode selftests, which puts an inline extent at file offset 5. Meanwhile the inline extent creation code will only create inline extent at file offset 0. Furthermore with the introduction of tree-checker on file extents, we are actively rejecting inline extent which starts at non-zero file offset. And so far we haven't yet seen any report of rejected inline extents at non-zero file offset. This all means, the extra calculation to support inline extents at non-zero file offset is mostly paper weight, and damaging the readability of the code. Thus this patch will: - Add extra ASSERT()s to make sure involved file offset are all 0 - Remove @extent_offset calculation - Simplify the involved code As several variables are now single-use, no need to declare them as a variable anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Qu Wenruo	d52a136525	btrfs: selftests: remove impossible inline extent at non-zero file offset In our inode-tests.c, we create an inline offset at file offset 5, which is no longer possible since the introduction of tree-checker. Thus I don't think we should spend time maintaining some corner cases which are already ruled out by tree-checker. So this patch will: - Change the inline extent to start at file offset 0 Also change its length to 6 to cover the original length - Add an extra ASSERT() for btrfs_add_extent_mapping() This is to make sure tree-checker is working correctly. - Update the inode selftest Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	aa5d3003dd	btrfs: move orphan prototypes into orphan.h Move these out of ctree.h into orphan.h to cut down on code in ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	7f0add250f	btrfs: move super_block specific helpers into super.h This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	c03b22076b	btrfs: move super prototypes into super.h Move these out of ctree.h into super.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	6a6b4daf92	btrfs: move CONFIG_BTRFS_FS_RUN_SANITY_TESTS checks to fs.h We already have a few of these in fs.h, move the remaining checks out of ctree.h into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	5c11adcc38	btrfs: move verity prototypes into verity.h Move these out of ctree.h into verity.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	77407dc032	btrfs: move dev-replace prototypes into dev-replace.h We already have a dev-replace.h, simply move these prototypes and helpers into dev-replace.h where they belong. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	2fc6822c99	btrfs: move scrub prototypes into scrub.h Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	677074792a	btrfs: move relocation prototypes into relocation.h Move these out of ctree.h into relocation.h to cut down on code in ctree.h Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	33cf97a7b6	btrfs: move acl prototypes into acl.h Move these out of ctree.h into acl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	cc68414c61	btrfs: move the snapshot drop related prototypes to extent-tree.h These belong in extent-tree.h, they were missed because they were not grouped with the other extent-tree.c prototypes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	b538a271ae	btrfs: move the 32bit warn defines into messages.h The code for these functions are in messages.c, move the defines and prototypes to messages.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	af142b6f44	btrfs: move file prototypes to file.h Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	7572dec8f5	btrfs: move ioctl prototypes into ioctl.h Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	c7a03b524d	btrfs: move uuid tree prototypes to uuid-tree.h Move these out of ctree.h into uuid-tree.h to cut down on the code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	7c8ede1628	btrfs: move file-item prototypes into their own header Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	f2b39277b8	btrfs: move dir-item prototypes into dir-item.h Move these prototypes out of ctree.h and into their own header file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	59b818e064	btrfs: move defrag related prototypes to their own header Now that the defrag code is all in one file, create a defrag.h and move all the defrag related prototypes and helper out of ctree.h and into defrag.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	a6a01ca61f	btrfs: move the file defrag code into defrag.c This is the other big portion of defrag code that has existed in ioctl.c. Move it to its new home in defrag.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	6e3df18ba7	btrfs: move the auto defrag code to defrag.c This currently exists in file.c, move it to the more natural location in defrag.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ reformat comments ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	778dd695dd	btrfs: rename tree-defrag.c to defrag.c This currently has only one helper in it, and it's for tree based defrag. We have the various defrag code in 3 different places, so rename this to defrag.c. Followup patches will move the code into this new file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	2885fd6320	btrfs: move inode prototypes to btrfs_inode.h I initially wanted to make a new header file for this, but these prototypes do naturally fit into btrfs_inode.h. If we want to extract vfs from pure btrfs code in the future we may need to split this up, but btrfs_inode embeds the vfs_inode, so it makes sense to put the prototypes in this header for now. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	083bd7e54e	btrfs: move the printk and assert helpers to messages.c These helpers are core to btrfs, and in order to more easily sync various parts of the btrfs kernel code into btrfs-progs we need to be able to carry these helpers with us. However we want to have our own implementation for the helpers themselves, currently they're implemented in different files that we want to sync inside of btrfs-progs itself. Move these into their own C file, this will allow us to contain our overrides in btrfs-progs in it's own file without messing with the rest of the codebase. In copying things over I fixed up a few whitespace errors that already existed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	5034388342	btrfs: add blk_types.h include to compression.h When moving the printk messages into their own file I got a compiler error because the includes grabbed compression.h, but nothing pulled in the blk_types.h dependency that compression.h has because it uses blkstatus_t. Add blk_types.h to compression.h so that this sort of thing doesn't happen in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	3683fbbc23	btrfs: add dependencies to fs.h and block-rsv.h There's several structures that are embedded inside of fs_info.h, so if we don't have all the proper includes when we include fs.h we'll get a variety of compile errors. I fixed this by adding a temporary c file that just had #include "fs.h" and then added include files until the compiler stopped complaining. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	b31bed170d	btrfs: move btrfs_chunk_item_size out of ctree.h This is used by the volumes code and the tree checker code. We want to maintain inline however, so simply move it to volumes.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	cb9a10a650	btrfs: convert discard stat defs to enum Do away with the defines and use an enum as it's cleaner. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
David Sterba	43dd529abe	btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	1751850fbd	btrfs: remove unused btrfs_cond_migrate_bytes The last user of this was removed in `7f9fe61440` ("btrfs: improve global reserve stealing logic"), drop this code as it's no longer called by anybody. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	911bd75aca	btrfs: remove unused function prototypes I wrote the following coccinelle script to find function declarations that didn't have the corresponding code for them @funcproto@ identifier func; type T; position p0; @@ T func@p0(...); @funccode@ identifier funcproto.func; position p1; @@ func@p1(...) { ... } @script:python depends on !funccode@ p0 << funcproto.p0; @@ print("Proto with no function at %s:%s" % (p0[0].file, p0[0].line)) and ran it against btrfs, which identified the 4 function prototypes I've removed in this patch. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	45c40c8f95	btrfs: move root tree prototypes to their own header Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	6d2049a2f3	btrfs: delete unused function prototypes in ctree.h This batch of prototypes no longer have code associated with them, so remove them. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	2839c2c142	btrfs: move delalloc space related prototypes to delalloc-space.h These exist in delalloc-space.c, move them from ctree.h into delalloc-space.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	a0231804af	btrfs: move extent-tree helpers into their own header file Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	e2f13b343c	btrfs: move btrfs_account_ro_block_groups_free_space into space-info.c This was prototyped in ctree.h and the code existed in extent-tree.c, but it's space-info related so move it into space-info.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	8483d40242	btrfs: remove extra space info prototypes in ctree.h These are defined already in space-info.h, remove them from ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	13d925c1c2	btrfs: minor whitespace in ctree.h We've accumulated some whitespace problems in ctree.h, clean these up. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	eb33a4d65b	btrfs: move the lockdep helpers into locking.h These more naturally fit in with the locking related code, and they're all defines so they can easily go anywhere, move them out of ctree.h into locking.h Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	a56159d408	btrfs: move btrfs_fs_info declarations into fs.h Now that we have a lot of the fs_info related helpers and stuff isolated, copy these over to fs.h out of ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Omar Sandoval	94a48aef49	btrfs: extend btrfs_dir_item type to store encryption status For directories with encrypted files/filenames, we need to store a flag indicating this fact. There's no room in other fields, so we'll need to borrow a bit from dir_type. Since it's now a combination of type and flags, we rename it to dir_flags to reflect its new usage. The new flag, FT_ENCRYPTED, indicates a directory containing encrypted data, which is orthogonal to file type; therefore, add the new flag, and make conversion from directory type to file type strip the flag. As the file types almost never change we can afford to use the bits. Actual usage will be guarded behind an incompat bit, this patch only adds the support for later use by fscrypt. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy	6db7531882	btrfs: use struct fscrypt_str instead of struct qstr While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy	ab3c5c18e8	btrfs: setup qstr from dentrys using fscrypt helper Most places where we get a struct qstr, we are doing so from a dentry. With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt provides a helper to convert a dentry name to the appropriate disk name if necessary. Convert each of the dentry name accesses to use fscrypt_setup_filename(), then convert the resulting fscrypt_name back to an unencrypted qstr. This does not work for nokey names, but the specific locations that could spawn nokey names are noted. At present, since there are no encrypted directories, nothing goes down the filename encryption paths. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy	e43eec81c5	btrfs: use struct qstr instead of name and namelen pairs Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
Anand Jain	82c0efd3cd	btrfs: merge module cleanup sequence to one helper The module exit function exit_btrfs_fs() is duplicating a section of code in init_btrfs_fs(). Add a helper to remove the duplicated code. Due to the init/exit section requirements the function must be inline and not a plain static as it could cause section mismatch. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
David Sterba	02bc392798	btrfs: sink gfp_t parameter to alloc_scrub_sector All callers pas GFP_KERNEL as parameter so we can use it directly in alloc_scrub_sector. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
David Sterba	fe10158c75	btrfs: switch GFP_NOFS to GFP_KERNEL in scrub_setup_recheck_block There's only one caller that calls scrub_setup_recheck_block in the memalloc_nofs_save/_restore protection so it's effectively already GFP_NOFS and it's safe to use GFP_KERNEL. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
David Sterba	e2896e7910	btrfs: sink gfp_t parameter to btrfs_qgroup_trace_extent All callers pass GFP_NOFS, we can drop the parameter and use it directly. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
David Sterba	d68194b238	btrfs: sink gfp_t parameter to btrfs_backref_iter_alloc There's only one caller that passes GFP_NOFS, we can drop the parameter an use the flags directly. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
Josef Bacik	e9c83077d2	btrfs: remove temporary btrfs_map_token declaration in ctree.h This was added while I was moving this code to its new home, it can be removed now. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:43 +01:00
Josef Bacik	07e81dc944	btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	ad1ac5012c	btrfs: move btrfs_map_token to accessors This is specific to the item-accessor code, move it out of ctree.h into accessor.h/.c and then update the users to include the new header file. This un-inlines btrfs_init_map_token, however this is only called once per function so it's not critical to be inlined. This also saves 904 bytes of code on a release build. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	818fe33aed	btrfs: rename struct-funcs.c to accessors.c Rename struct-funcs.c to accessors.c so we can move the item accessors out of ctree.h. accessors.c is a better description of the code that is contained in these files. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	d83eb482b7	btrfs: move the compat/incompat flag masks to fs.h This is fs wide information, move it out of ctree.h into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	55e5cfd36d	btrfs: remove fs_info::pending_changes and related code Now that we're not using this code anywhere we can remove it as well as the member from fs_info. We don't have any mount options or on/off features that would utilize the pending infrastructure, the last one was inode_cache. There was a patchset [1] to enable some features from sysfs that would break things if it would be set immediately. In case we'll need that kind of logic again the patch can be reverted, but for the current use it can be replaced by the single state bit to do the commit. [1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	c52cc7b7ac	btrfs: add a BTRFS_FS_NEED_TRANS_COMMIT flag Currently we are only using fs_info->pending_changes to indicate that we need a transaction commit. The original users for this were removed years ago and we don't have more usage in sight, so this is the only remaining reason to have this field. Add a flag so we can remove this code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	7966a6b595	btrfs: move fs_info::flags enum to fs.h These definitions are fs wide, take them out of ctree.h and put them in fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	fc97a410bd	btrfs: move mount option definitions to fs.h These are fs wide definitions and helpers, move them out of ctree.h and into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	0d3a9cf8c3	btrfs: convert incompat and compat flag test helpers to macros These helpers use functions not defined in fs.h, they're simply accessors of the super block in fs_info, convert them to macros so that we don't have a weird dependency between fs.h and accessors.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	ec8eb376e2	btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h We're going to use fs.h to hold fs wide related helpers and definitions, move the FS_STATE enum and related helpers to fs.h, and then update all files that need these definitions to include fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	bbde07a40a	btrfs: push printk index code into their respective helpers The printk index work can be pushed into the printk helpers themselves, this allows us to further sanitize messages.h, removing the last include in the header itself. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	9b569ea0be	btrfs: move the printk helpers out of ctree.h We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	e118578a8d	btrfs: move assert helpers out of ctree.h These call functions that aren't defined in, or will be moved out of, ctree.h Move them to super.c where the other assert/error message code is defined. Drop the __noreturn attribute for btrfs_assertfail as objtool does not like it and fails with warnings like fs/btrfs/dir-item.o: warning: objtool: .text.unlikely: unexpected end of section fs/btrfs/xattr.o: warning: objtool: btrfs_setxattr() falls through to next function btrfs_setxattr_trans.cold() fs/btrfs/xattr.o: warning: objtool: .text.unlikely: unexpected end of section Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	c7f13d428e	btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Wang Yugui	875c627c5f	btrfs: send add define for v2 buffer size Add a define for the data buffer size (though the maximum size is not limited by it) BTRFS_SEND_BUF_SIZE_V2 so it's more visible. Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
David Sterba	b307f06d37	btrfs: simplify generation check in btrfs_get_dentry Callers that pass non-zero generation always want to perform the generation check, we can simply encode that in one parameter and drop check_generation. Add function documentation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
David Sterba	63a7cb1307	btrfs: auto enable discard=async when possible There's a request to automatically enable async discard for capable devices. We can do that, the async mode is designed to wait for larger freed extents and is not intrusive, with limits to iops, kbps or latency. The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . The automatic selection is done if there's at least one discard capable device in the filesystem (not capable devices are skipped). Mounting with any other discard option will honor that option, notably mounting with nodiscard will keep it disabled. Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
David Sterba	467761f904	btrfs: sysfs: convert remaining scnprintf to sysfs_emit The sysfs_emit is the safe API for writing to the sysfs files, previously converted from scnprintf, there's one left to do in btrfs_read_policy_show. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	5a75034e71	btrfs: do not panic if we can't allocate a prealloc extent state We sometimes have to allocate new extent states when clearing or setting new bits in an extent io tree. Generally we preallocate this before taking the tree spin lock, but we can use this preallocated extent state sometimes and then need to try to do a GFP_ATOMIC allocation under the lock. Unfortunately sometimes this fails, and then we hit the BUG_ON() and bring the box down. This happens roughly 20 times a week in our fleet. However the vast majority of callers use GFP_NOFS, which means that if this GFP_ATOMIC allocation fails, we could simply drop the spin lock, go back and allocate a new extent state with our given gfp mask, and begin again from where we left off. For the remaining callers that do not use GFP_NOFS, they are generally using GFP_NOWAIT, which still allows for some reclaim. So allow these allocations to attempt to happen outside of the spin lock so we don't need to rely on GFP_ATOMIC allocations. This in essence creates an infinite loop for anything that isn't GFP_NOFS. To address this we may want to migrate to using mempools for extent states so that we will always have emergency reserves in order to make our allocations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	da2a071b6f	btrfs: remove unused unlock_extent_atomic As of "btrfs: do not use GFP_ATOMIC in the read endio" we no longer have any users of unlock_extent_atomic, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	48acc47d78	btrfs: do not use GFP_ATOMIC in the read endio We have done read endio in an async thread for a very, very long time, which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in our read endio path. We've noticed under heavy memory pressure in our fleet that we can fail these allocations, and then often trip a BUG_ON(!allocation), which isn't an ideal outcome. Begin to address this by simply not using GFP_ATOMIC, which will allow us to do things like actually allocate a extent state when doing set_extent_bits(UPTODATE) in the endio handler. End io handlers are not called in atomic context, besides we have been allocating failrec with GFP_NOFS so we'd notice there's a problem. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
Qu Wenruo	7248e0cebb	btrfs: skip update of block group item if used bytes are the same [BACKGROUND] When committing a transaction, we will update block group items for all dirty block groups. But in fact, dirty block groups don't always need to update their block group items. It's pretty common to have a metadata block group which experienced several COW operations, but still have the same amount of used bytes. In that case, we may unnecessarily COW a tree block doing nothing. [ENHANCEMENT] This patch will introduce btrfs_block_group::commit_used member to remember the last used bytes, and use that new member to skip unnecessary block group item update. This would be more common for large filesystems, where metadata block group can be as large as 1GiB, containing at most 64K metadata items. In that case, if COW added and then deleted one metadata item near the end of the block group, then it's completely possible we don't need to touch the block group item at all. [BENCHMARK] The change itself can have quite a high chance (20~80%) to skip block group item updates in lot of workloads. As a result, it would result shorter time spent on btrfs_write_dirty_block_groups(), and overall reduce the execution time of the critical section of btrfs_commit_transaction(). Here comes a fio command, which will do random writes in 4K block size, causing a very heavy metadata updates. fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \ --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \ --name=random_write --fallocate=none --time_based --fsync_on_close=1 The file size (512M) and number of threads (4) means 2GiB file size in total, but during the full 300s run time, my dedicated SATA SSD is able to write around 20~25GiB, which is over 10 times the file size. Thus after we fill the initial 2G, we should not cause much block group item updates. Please note, the fio numbers by themselves don't have much change, but if we look deeper, there is some reduced execution time, especially for the critical section of btrfs_commit_transaction(). I added extra trace_printk() to measure the following per-transaction execution time: - Critical section of btrfs_commit_transaction() By re-using the existing update_commit_stats() function, which has already calculated the interval correctly. - The while() loop for btrfs_write_dirty_block_groups() Although this includes the execution time of btrfs_run_delayed_refs(), it should still be representative overall. Both result involves transid 7~30, the same amount of transaction committed. The result looks like this: \| Before \| After \| Diff ----------------------+-------------------+----------------+-------- Transaction interval \| 229247198.5 \| 215016933.6 \| -6.2% Block group interval \| 23133.33333 \| 18970.83333 \| -18.0% The change in block group item updates is more obvious, as skipped block group item updates also mean less delayed refs. And the overall execution time for that block group update loop is pretty small, thus we can assume the extent tree is already mostly cached. If we can skip an uncached tree block, it would cause more obvious change. Unfortunately the overall reduction in commit transaction critical section is much smaller, as the block group item updates loop is not really the major part, at least not for the above fio script. But still we have a observable reduction in the critical section. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
David Sterba	cc37ea6192	btrfs: convert __TRANS_* defines to enum bits The base transaction bits can be defined as bits in a contiguous sequence, although right now there's a hole from bit 1 to 8. The bits are used for btrfs_trans_handle::type, and there's another set of TRANS_STATE_* defines that are for btrfs_transaction::state. They are mutually exclusive though the hole in the sequence looks like was made for the states. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
David Sterba	e0a8b9a747	btrfs: convert QGROUP_* defines to enum bits The defines/enums are used only for tracepoints and are not part of the on-disk format. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
David Sterba	d3b4d0fd55	btrfs: convert EXTENT_* bits to enums Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
David Sterba	fd8d2951f4	btrfs: convert extent_io page op defines to enum bits Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
David Sterba	c7321b76df	btrfs: convert BTRFS_ILOCK-* defines to enum bit Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
David Sterba	d549ff7bdb	btrfs: add helper for bit enumeration Define helper macro that can be used in enum {} to utilize the automatic increment to define all bits without directly defining the values or using additional linear bits. 1. capture the sequence value, N 2. use the value to define the given enum with N-th bit set 3. reset the sequence back to N Use for enums that do not require fixed values for symbolic names (like for on-disk structures): enum { ENUM_BIT(FIRST), ENUM_BIT(SECOND), ENUM_BIT(THIRD) }; Where the values would be 0x1, 0x2 and 0x4. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
Qu Wenruo	5565b8e0ad	btrfs: make module init/exit match their sequence [BACKGROUND] In theory init_btrfs_fs() and exit_btrfs_fs() should match their sequence, thus normally they should look like this: init_btrfs_fs() \| exit_btrfs_fs() ----------------------+------------------------ init_A(); \| init_B(); \| init_C(); \| \| exit_C(); \| exit_B(); \| exit_A(); So is for the error path of init_btrfs_fs(). But it's not the case, some exit functions don't match their init functions sequence in init_btrfs_fs(). Furthermore in init_btrfs_fs(), we need to have a new error label for each new init function we added. This is not really expandable, especially recently we may add several new functions to init_btrfs_fs(). [ENHANCEMENT] The patch will introduce the following things to enhance the situation: - struct init_sequence Just a wrapper of init and exit function pointers. The init function must use int type as return value, thus some init functions need to be updated to return 0. The exit function can be NULL, as there are some init sequence just outputting a message. - struct mod_init_seq[] array This is a const array, recording all the initialization we need to do in init_btrfs_fs(), and the order follows the old init_btrfs_fs(). - bool mod_init_result[] array This is a bool array, recording if we have initialized one entry in mod_init_seq[]. The reason to split mod_init_seq[] and mod_init_result[] is to avoid section mismatch in reference. All init function are in .init.text, but if mod_init_seq[] records the @initialized member it can no longer be const, thus will be put into .data section, and cause modpost warning. For init_btrfs_fs() we just call all init functions in their order in mod_init_seq[] array, and after each call, setting corresponding mod_init_result[] to true. For exit_btrfs_fs() and error handling path of init_btrfs_fs(), we just iterate mod_init_seq[] in reverse order, and skip all uninitialized entry. With this patch, init_btrfs_fs()/exit_btrfs_fs() will be much easier to expand and will always follow the strict order. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
Filipe Manana	33cff222fa	btrfs: remove gfp_t flag from btrfs_tree_mod_log_insert_key() All callers of btrfs_tree_mod_log_insert_key() are now passing a GFP_NOFS flag to it, so remove the flag from it and from alloc_tree_mod_elem() and use it directly within alloc_tree_mod_elem(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
Filipe Manana	879b222198	btrfs: switch GFP_ATOMIC to GFP_NOFS when fixing up low keys When fixing up the first key of each node above the current level, at fixup_low_keys(), we are doing a GFP_ATOMIC allocation for inserting an operation record for the tree mod log. However we can do just fine with GFP_NOFS nowadays. The need for GFP_ATOMIC was for the old days when we had custom locks with spinning behaviour for extent buffers and we were in spinning mode while at fixup_low_keys(). Now we use rw semaphores for extent buffer locks, so we can safely use GFP_NOFS. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:40 +01:00
Boris Burkov	81531225e5	btrfs: re-check reclaim condition in reclaim worker I have observed the following case play out and lead to unnecessary relocations: 1. write a file across multiple block groups 2. delete the file 3. several block groups fall below the reclaim threshold 4. reclaim the first, moving extents into the others 5. reclaim the others which are now actually very full, leading to poor reclaim behavior with lots of writing, allocating new block groups, etc. I believe the risk of missing some reasonable reclaims is worth it when traded off against the savings of avoiding overfull reclaims. Going forward, it could be interesting to make the check more advanced (zoned aware, fragmentation aware, etc...) so that it can be a really strong signal both at extent delete and reclaim time. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Boris Burkov	cc4804bfd6	btrfs: skip reclaim if block_group is empty As we delete extents from a block group, at some deletion we cross below the reclaim threshold. It is possible we are still in the middle of deleting more extents and might soon hit 0. If the block group is empty by the time the reclaim worker runs, we will still relocate it. This works just fine, as relocating an empty block group ultimately results in properly deleting it. However, we have more direct ways of removing empty block groups in the cleaner thread. Those are either async discard or the unused_bgs list. In fact, when we decide whether to relocate a block group during extent deletion, we do check for emptiness and prefer the discard/unused_bgs mechanisms when possible. Not using relocation for this case reduces some modest overhead from empty bg relocation: - extra transactions - extra metadata use/churn for creating relocation metadata - trying to read the extent tree to look for extents (and in this case finding none) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	6976201f18	btrfs: avoid unnecessary resolution of indirect backrefs during fiemap During fiemap, when determining if a data extent is shared or not, if we don't find the extent is directly shared, then we need to determine if it's shared through subtrees. For that we need to resolve the indirect reference we found in order to figure out the path in the inode's fs tree, which is a path starting at the fs tree's root node and going down to the leaf that contains the file extent item that points to the data extent. We then proceed to determine if any extent buffer in that path is shared with other trees or not. However when the generation of the data extent is more recent than the last generation used to snapshot the root, we don't need to determine the path, since the data extent can not be shared through snapshots. For this case we currently still determine the leaf of that path (at find_parent_nodes(), but then stop determining the other nodes in the path (at btrfs_is_data_extent_shared()) as it's pointless. So do the check of the data extent's generation earlier, at find_parent_nodes(), before trying to resolve the indirect reference to determine the leaf in the path. This saves us from doing one expensive b+tree search in the fs tree of our target inode, as well as other minor work. The following test was run on a non-debug kernel (Debian's default kernel config): $ cat test-fiemap.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 40G gives 327680 extents, each with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar # Add some more files to increase the size of the fs and extent # trees (in the real world there's a lot of files and extents # from other files). xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1 xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2 xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3 umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" echo start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before applying this patch: (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1285 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 742 milliseconds (metadata cached) After applying this patch: (...) /mnt/sdi/foobar: 327680 extents found fiemap took 689 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 393 milliseconds (metadata cached) That's a -46.4% total reduction for the metadata not cached case, and a -47.0% reduction for the cached metadata case. The test is somewhat limited in the sense the gains may be higher in practice, because in the test the filesystem is small, so we have small fs and extent trees, plus there's no concurrent access to the trees as well, therefore no lock contention there. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	877c14767f	btrfs: avoid duplicated resolution of indirect backrefs during fiemap During fiemap, when determining if a data extent is shared or not, if we don't find the extent is directly shared, then we need to determine if it's shared through subtrees. For that we need to resolve the indirect reference we found in order to figure out the path in the inode's fs tree, which is a path starting at the fs tree's root node and going down to the leaf that contains the file extent item that points to the data extent. We then proceed to determine if any extent buffer in that path is shared with other trees or not. Currently whenever we find the data extent that a file extent item points to is not directly shared, we always resolve the path in the fs tree, and then check if any extent buffer in the path is shared. This is a lot of work and when we have file extent items that belong to the same leaf, we have the same path, so we only need to calculate it once. This change does that, it keeps track of the current and previous leaf, and when we find that a data extent is not directly shared, we try to compute the fs tree path only once and then use it for every other file extent item in the same leaf, using the existing cached path result for the leaf as long as the cache results are valid. This saves us from doing expensive b+tree searches in the fs tree of our target inode, as well as other minor work. The following test was run on a non-debug kernel (Debian's default kernel config): $ cat test-with-snapshots.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 40G gives 327680 extents, each with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar # Add some more files to increase the size of the fs and extent # trees (in the real world there's a lot of files and extents # from other files). xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1 xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2 xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3 # Create a snapshot so all the extents become indirectly shared # through subtrees, with a generation less than or equals to the # generation used to create the snapshot. btrfs subvolume snapshot -r $MNT $MNT/snap1 umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" echo start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Result before applying this patch: (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1204 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 729 milliseconds (metadata cached) Result after applying this patch: (...) /mnt/sdi/foobar: 327680 extents found fiemap took 732 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 421 milliseconds (metadata cached) That's a -46.1% total reduction for the metadata not cached case, and a -42.2% reduction for the cached metadata case. The test is somewhat limited in the sense the gains may be higher in practice, because in the test the filesystem is small, so we have small fs and extent trees, plus there's no concurrent access to the trees as well, therefore no lock contention there. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	583f4ac562	btrfs: move up backref sharedness cache store and lookup functions Move the static functions to lookup and store sharedness check of an extent buffer to a location above find_all_parents(), because in the next patch the lookup function will be used by find_all_parents(). The store function is also moved just because it's the counter part to the lookup function and it's best to have their definitions close together. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	73e339e6ab	btrfs: cache sharedness of the last few data extents during fiemap During fiemap we process all the file extent items of an inode, by their file offset order (left to right b+tree order), and then check if the data extent they point at is shared or not. Until now we didn't cache those results, we only did it for b+tree nodes/leaves since for each unique b+tree path we have access to hundreds of file extent items. However, it is also common to repeat checking the sharedness of a particular data extent in a very short time window, and the cases that lead to that are the following: 1) COW writes. If have a file extent item like this: [ bytenr X, offset = 0, num_bytes = 512K ] file offset 0 512K Then a 4K write into file offset 64K happens, we end up with the following file extent item layout: [ bytenr X, offset = 0, num_bytes = 64K ] file offset 0 64K [ bytenr Y, offset = 0, num_bytes = 4K ] file offset 64K 68K [ bytenr X, offset = 68K, num_bytes = 444K ] file offset 68K 512K So during fiemap we well check for the sharedness of the data extent with bytenr X twice. Typically for COW writes and for at least moderately updated files, we end up with many file extent items that point to different sections of the same data extent. 2) Writing into a NOCOW file after a snapshot is taken. This happens if the target extent was created in a generation older than the generation where the last snapshot for the root (the tree the inode belongs to) was made. This leads to a scenario like the previous one. 3) Writing into sections of a preallocated extent. For example if a file has the following layout: [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ] 0 1M After doing a 4K write into file offset 0 and another 4K write into offset 512K, we get the following layout: [ bytenr X, offset = 0, num_bytes = 4K, type = regular ] 0 4K [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ] 4K 512K [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ] 512K 516K [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ] 516K 1M So we end up with 4 consecutive file extent items pointing to the data extent at bytenr X. 4) Hole punching in the middle of an extent. For example if a file has the following file extent item: [ bytenr X, offset = 0, num_bytes = 8M ] 0 8M And then hole is punched for the file range [4M, 6M[, we our file extent item split into two: [ bytenr X, offset = 0, num_bytes = 4M ] 0 4M [ 2M hole, implicit or explicit depending on NO_HOLES feature ] 4M 6M [ bytenr X, offset = 6M, num_bytes = 2M ] 6M 8M Again, we end up with two file extent items pointing to the same data extent. 5) When reflinking (clone and deduplication) within the same file. This is probably the least common case of all. In cases 1, 2, 4 and 4, when we have multiple file extent items that point to the same data extent, their distance is usually short, typically separated by a few slots in a b+tree leaf (or across sibling leaves). For case 5, the distance can vary a lot, but it's typically the less common case. This change caches the result of the sharedness checks for data extents, but only for the last 8 extents that we notice that our inode refers to with multiple file extent items. Whenever we want to check if a data extent is shared, we lookup the cache which consists of doing a linear scan of an 8 elements array, and if we find the data extent there, we return the result and don't check the extent tree and delayed refs. The array/cache is small so that doing the search has no noticeable negative impact on the performance in case we don't have file extent items within a distance of 8 slots that point to the same data extent. Slots in the cache/array are overwritten in a simple round robin fashion, as that approach fits very well. Using this simple approach with only the last 8 data extents seen is effective as usually when multiple file extents items point to the same data extent, their distance is within 8 slots. It also uses very little memory and the time to cache a result or lookup the cache is negligible. The following test was run on non-debug kernel (Debian's default kernel config) to measure the impact in the case of COW writes (first example given above), where we run fiemap after overwriting 33% of the blocks of a file: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # Create the file full of 1M extents. xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar block_count=$((FILE_SIZE / 4096)) # Overwrite about 33% of the file blocks. overwrite_count=$((block_count / 3)) echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..." RANDOM=123 for ((i = 1; i <= $overwrite_count; i++)); do off=$(((RANDOM % block_count) * 4096)) xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null echo -ne "\r$i blocks overwritten..." done echo -e "\n" # Unmount and mount to clear all cached metadata. umount $MNT mount $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds" umount $MNT Result before applying this patch: fiemap took 128 milliseconds Result after applying this patch: fiemap took 92 milliseconds (-28.1%) The test is somewhat limited in the sense the gains may be higher in practice, because in the test the filesystem is small, so we have small fs and extent trees, plus there's no concurrent access to the trees as well, therefore no lock contention there. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	56f5c19920	btrfs: remove useless logic when finding parent nodes At find_parent_nodes(), at its last step, when iterating over all direct references, we are checking if we have a share context and if we have a reference with a different root from the one in the share context. However that logic is pointless because of two reasons: 1) After the previous patch in the series (subject "btrfs: remove roots ulist when checking data extent sharedness"), the roots argument is always NULL when using a share check context (struct share_check), so this code is never triggered; 2) Even before that previous patch, we could not hit this code because if we had a reference with a root different from the one in our share context, then we would have exited earlier when doing either of the following: - Adding a second direct ref to the direct refs red black tree resulted in extent_is_shared() returning true when called from add_direct_ref() -> add_prelim_ref(), after processing delayed references or while processing references in the extent tree; - When adding a second reference to the indirect refs red black tree (same as above, extent_is_shared() returns true); - If we only have one indirect reference and no direct references, then when resolving it at resolve_indirect_refs() we immediately return that the target extent is shared, therefore never reaching that loop that iterates over all direct references at find_parent_nodes(); - If we have 1 indirect reference and 1 direct reference, then we also exit early because extent_is_shared() ends up returning true when called through add_prelim_ref() (by add_direct_ref() or add_indirect_ref()) or add_delayed_refs(). Same applies as when having a combination of direct, indirect and indirect with missing key references. This logic had been obsoleted since commit `3ec4d3238a` ("btrfs: allow backref search checks for shared extents"), which introduced the early exits in case an extent is shared. So just remove that logic, and assert at find_parent_nodes() that when we have a share context we don't have a roots ulist and that we haven't found the extent to be directly shared after processing delayed references and all references from the extent tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	b629685803	btrfs: remove roots ulist when checking data extent sharedness Currently btrfs_is_data_extent_shared() is passing a ulist for the roots argument of find_parent_nodes(), however it does not use that ulist for anything and for this context that list always ends up with at most one element. Since find_parent_nodes() is able to deal with a NULL ulist for its roots argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the burden of allocating memory for the unnused roots ulist, initializing it, releasing it and allocating one struct ulist_node for it during the call to find_parent_nodes(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	84a7949d40	btrfs: move ulists to data extent sharedness check context When calling btrfs_is_data_extent_shared() we pass two ulists that were allocated by the caller. This is because the single caller, fiemap, calls btrfs_is_data_extent_shared() multiple times and the ulists can be reused, instead of allocating new ones before each call and freeing them after each call. Now that we have a context structure/object that we pass to btrfs_is_data_extent_shared(), we can move those ulists to it, and hide their allocation and the context's allocation in a helper function, as well as the freeing of the ulists and the context object. This allows to reduce the number of parameters passed to btrfs_is_data_extent_shared(), the need to pass the ulists from extent_fiemap() to fiemap_process_hole() and having the caller deal with allocating and releasing the ulists. Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since that's a much better name as it reflects what the list is used for (and matching the argument name for find_parent_nodes()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	61dbb952f0	btrfs: turn the backref sharedness check cache into a context object Right now we are using a struct btrfs_backref_shared_cache to pass state across multiple btrfs_is_data_extent_shared() calls. The structure's name closely follows its current purpose, which is to cache previous checks for the sharedness of metadata extents. However we will start using the structure for more things other than caching sharedness checks, so rename it to struct btrfs_backref_share_check_ctx. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	ceb707da9a	btrfs: directly pass the inode to btrfs_is_data_extent_shared() Currently we pass a root and an inode number as arguments for btrfs_is_data_extent_shared() and the inode number is always from an inode that belongs to that root (it wouldn't make sense otherwise). In every context that we call btrfs_is_data_extent_shared() (fiemap only), we have an inode available, so directly pass the inode to the function instead of a root and inode number. This reduces the number of parameters and it makes the function's signature conform to most other functions we have. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:39 +01:00
Filipe Manana	a0a5472ad8	btrfs: remove checks for a 0 inode number during backref walking When doing backref walking to determine if an extent is shared, we are testing if the inode number, stored in the 'inum' field of struct share_check, is 0. However that can never be case, since the all instances of the structure are created at btrfs_is_data_extent_shared(), which always initializes it with the inode number from a fs tree (and the number for any inode from any tree can never be 0). So remove the checks. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Filipe Manana	c902421927	btrfs: remove checks for a root with id 0 during backref walking When doing backref walking to determine if an extent is shared, we are testing the root_objectid of the given share_check struct is 0, but that is an impossible case, since btrfs_is_data_extent_shared() always initializes the root_objectid field with the id of the given root, and no root can have an objectid of 0. So remove those checks. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Filipe Manana	206c1d32f3	btrfs: drop redundant bflags initialization when allocating extent buffer When allocating an extent buffer, at __alloc_extent_buffer(), there's no point in explicitly assigning zero to the bflags field of the new extent buffer because we allocated it with kmem_cache_zalloc(). So just remove the redundant initialization, it saves one mov instruction in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Filipe Manana	b98c6cd59e	btrfs: drop pointless memset when cloning extent buffer At btrfs_clone_extent_buffer(), before allocating the pages array for the new extent buffer we are calling memset() to zero out the pages array of the extent buffer. This is pointless however, because the extent buffer already has every element in its pages array pointing to NULL, as it was allocated with kmem_cache_zalloc(). The memset() was introduced with commit `dd137dd1f2` ("btrfs: factor out allocating an array of pages"), but even before that commit we already depended on the pages array being initialized to NULL for the error paths that need to call btrfs_release_extent_buffer(). So remove the memset(), it's useless and slightly increases the object text size. Before this change: $ size fs/btrfs/extent_io.o text data bss dec hex filename 70580 5469 40 76089 12939 fs/btrfs/extent_io.o After this change: $ size fs/btrfs/extent_io.o text data bss dec hex filename 70564 5469 40 76073 12929 fs/btrfs/extent_io.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Filipe Manana	a2853ffc2e	btrfs: skip unnecessary delalloc search during fiemap and lseek During fiemap and lseek (hole and data seeking), there's no point in iterating the inode's io tree to count delalloc bits if the inode's delalloc bytes counter has a value of zero, as that counter is updated whenever we set a range for delalloc or clear a range from delalloc. So skip the counting and io tree iteration if the inode's delalloc bytes counter has a value of zero. This helps save time when processing a file range corresponding to a hole or prealloc (unwritten) extent. This patch is part of a series comprised of the following patches: btrfs: get the next extent map during fiemap/lseek more efficiently btrfs: skip unnecessary extent map searches during fiemap and lseek btrfs: skip unnecessary delalloc search during fiemap and lseek The following test was performed on a release kernel (Debian's default kernel config) before and after applying those 3 patches. # Wrapper to call fiemap in extent count only mode. # (struct fiemap::fm_extent_count set to 0) $ cat fiemap.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <linux/fiemap.h> int main(int argc, char *argv) { struct fiemap fiemap = { 0 }; int fd; if (argc != 2) { printf("usage: %s <path>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { fprintf(stderr, "error opening file: %s\n", strerror(errno)); return 1; } / fiemap.fm_extent_count set to 0, to count extents only. / fiemap.fm_length = FIEMAP_MAX_OFFSET; if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) { fprintf(stderr, "fiemap error: %s\n", strerror(errno)); return 1; } close(fd); printf("fm_mapped_extents = %d\n", fiemap.fm_mapped_extents); return 0; } $ gcc -o fiemap fiemap.c And the wrapper shell script that creates a file with many holes and runs fiemap against it: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1 1024 * 1024 * 1024)) echo -n > $MNT/foobar for ((off = 0; off < $FILE_SIZE; off += 8192)); do xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null done # flush all delalloc sync start=$(date +%s%N) ./fiemap $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds" umount $MNT Result before applying patchset: fm_mapped_extents = 131072 fiemap took 63 milliseconds Result after applying patchset: fm_mapped_extents = 131072 fiemap took 39 milliseconds (-38.1%) Running the same test for a 512M file instead of a 1G file, gave the following results. Result before applying patchset: fm_mapped_extents = 65536 fiemap took 29 milliseconds Result after applying patchset: fm_mapped_extents = 65536 fiemap took 20 milliseconds (-31.0%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Filipe Manana	013f9c70d2	btrfs: skip unnecessary extent map searches during fiemap and lseek If we have no outstanding extents it means we don't have any extent maps corresponding to delalloc that is flushing, as when an ordered extent is created we increment the number of outstanding extents to 1 and when we remove the ordered extent we decrement them by 1. So skip extent map tree searches if the number of outstanding ordered extents is 0, saving time as the tree is not empty if we have previously made some reads or flushed delalloc, as in those cases it can have a very large number of extent maps for files with many extents. This helps save time when processing a file range corresponding to a hole or prealloc (unwritten) extent. The next patch in the series has a performance test in its changelog and its subject is: "btrfs: skip unnecessary delalloc search during fiemap and lseek" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Filipe Manana	d47704bd1c	btrfs: get the next extent map during fiemap/lseek more efficiently At find_delalloc_subrange(), when we need to get the next extent map, we do a full search on the extent map tree (a red black tree). This is fine but it's a lot more efficient to simply use rb_next(), which typically requires iterating over less nodes of the tree and never needs to compare the ranges of nodes with the one we are looking for. So add a public helper to extent_map.{h,c} to get the extent map that immediately follows another extent map, using rb_next(), and use that helper at find_delalloc_subrange(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Qu Wenruo	88074c8b13	btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio). We call cache_rbio_pages() to mark a qualified rbio ready for cache. The timing happens at: - finish_rmw() At this timing, we have already read all necessary sectors, along with the rbio sectors, we have covered all data stripes. - __raid_recover_end_io() At this timing, we have rebuild the rbio, thus all data sectors involved (either from stripe or bio list) are uptodate now. Thus at the timing of cache_rbio_pages(), we should have all data sectors uptodate. This patch will make it explicit that all data sectors are uptodate at cache_rbio_pages() timing, mostly to prepare for the incoming verification at RMW time. This patch will add: - Extra ASSERT()s in cache_rbio_pages() This is to make sure all data sectors, which are not covered by bio, are already uptodate. - Extra ASSERT()s in steal_rbio() Since only cached rbio can be stolen, thus every data sector should already be uptodate in the source rbio. - Update __raid_recover_end_io() to update recovered sector->uptodate Previously __raid_recover_end_io() will only mark failed sectors uptodate if it's doing an RMW. But this can trigger new ASSERT()s, as for recovery case, a recovered failed sector will not be marked uptodate, and trigger ASSERT() in later cache_rbio_pages() call. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Qu Wenruo	797d74b749	btrfs: raid56: allocate memory separately for rbio pointers Currently inside alloc_rbio(), we allocate a larger memory to contain the following members: - struct btrfs_raid_rbio itself - stripe_pages array - bio_sectors array - stripe_sectors array - finish_pointers array Then update rbio pointers to point the extra space after the rbio structure itself. Thus it introduced a complex CONSUME_ALLOC() macro to help the thing. This is too hacky, and is going to make later pointers expansion harder. This patch will change it to use regular kcalloc() for each pointer inside btrfs_raid_bio, making the later expansion much easier. And introduce a helper free_raid_bio_pointers() to free up all the pointer members in btrfs_raid_bio, which will be used in both free_raid_bio() and error path of alloc_rbio(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Qu Wenruo	ff2b64a22a	btrfs: raid56: cleanup for function __free_raid_bio() The cleanup involves two things: - Remove the "__" prefix There is no naming confliction. - Remove the forward declaration There is no special function call involved. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Josef Bacik	765c3fe99b	btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY Inside of FB, as well as some user reports, we've had a consistent problem of occasional ENOSPC transaction aborts. Inside FB we were seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really low occurrence rate given the size of our fleet, but it's not nothing. There are two causes of this particular problem. First is delayed allocation. The reservation system for delalloc assumes that contiguous dirty ranges will result in 1 file extent item. However if there is memory pressure that results in fragmented writeout, or there is fragmentation in the block groups, this won't necessarily be true. Consider the case where we do a single 256MiB write to a file and then close it. We will have 1 reservation for the inode update, the reservations for the checksum updates, and 1 reservation for the file extent item. At some point later we decide to write this entire range out, but we're so fragmented that we break this into 100 different file extents. Since we've already closed the file and are no longer writing to it there's nothing to trigger a refill of the delalloc block rsv to satisfy the 99 new file extent reservations we need. At this point we exhaust our delalloc reservation, and we begin to steal from the global reserve. If you have enough of these cases going in parallel you can easily exhaust the global reserve, get an ENOSPC at btrfs_alloc_tree_block() time, and then abort the transaction. The other case is the delayed refs reserve. The delayed refs reserve updates its size based on outstanding delayed refs and dirty block groups. However we only refill this block reserve when returning excess reservations and when we call btrfs_start_transaction(root, X). We will reserve 2*X credits at transaction start time, and fill in X into the delayed refs reserve to make sure it stays topped off. Generally this works well, but clearly has downsides. If we do a particularly delayed ref heavy operation we may never catch up in our reservations. Additionally running delayed refs generates more delayed refs, and at that point we may be committing the transaction and have no way to trigger a refill of our delayed refs rsv. Then a similar thing occurs with the delalloc reserve. Generally speaking we well over-reserve in all of our block rsvs. If we reserve 1 credit we're usually reserving around 264k of space, but we'll often not use any of that reservation, or use a few blocks of that reservation. We can be reasonably sure that as long as you were able to reserve space up front for your operation you'll be able to find space on disk for that reservation. So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This gets used in the case that we've exhausted our reserve and the global reserve. It simply forces a reservation if we have enough actual space on disk to make the reservation, which is almost always the case. This keeps us from hitting ENOSPC aborts in these odd occurrences where we've not kept up with the delayed work. Fixing this in a complete way is going to be relatively complicated and time consuming. This patch is what I discussed with Filipe earlier this year, and what I put into our kernels inside FB. With this patch we're down to 1-2 ENOSPC aborts per week, which is a significant reduction. This is a decent stop gap until we can work out a more wholistic solution to these two corner cases. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:38 +01:00
Josef Bacik	7a66eda351	btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h These are wrapped in CONFIG_FS_VERITY, but we can have the definitions without verity enabled. Move these definitions up with the other accessor helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:37 +01:00
Josef Bacik	890d2b1aa3	btrfs: move btrfs_next_old_item into ctree.c This uses btrfs_header_nritems, which I will be moving out of ctree.h. In order to avoid needing to include the relevant header in ctree.h, simply move this helper function into ctree.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename parameters ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:37 +01:00
Josef Bacik	eda517fd0c	btrfs: move free space cachep's out of ctree.h This is local to the free-space-cache.c code, remove it from ctree.h and inode.c, create new init/exit functions for the cachep, and move it locally to free-space-cache.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:37 +01:00

1 2 3 4 5 ...

11667 Commits