1749 Commits

Author SHA1 Message Date
David Sterba
150cce2d9f btrfs: move extent_buffer::lock_owner to debug section
The lock_owner is used for a rare corruption case and we haven't seen
any reports in years. Move it to the debugging section of eb.  To close
the holes also move log_index so the final layout looks like:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        int                        read_mirror;          /*    40     4 */
        s8                         log_index;            /*    44     1 */

        /* XXX 3 bytes hole, try to pack */

        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct rw_semaphore        lock;                 /*    64    40 */
        struct page *              pages[16];            /*   104   128 */

        /* size: 232, cachelines: 4, members: 11 */
        /* sum members: 229, holes: 1, sum holes: 3 */
        /* forced alignments: 1, forced holes: 1, sum forced holes: 3 */
        /* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

This saves 8 bytes in total and still keeps the lock on a separate cacheline.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
David Sterba
007dec8c7e btrfs: reduce parameters of btrfs_pin_extent_for_log_replay
Both callers of btrfs_pin_extent_for_log_replay expand the parameters to
extent buffer members. We can simply pass the extent buffer instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
f863c50277 btrfs: reduce parameters of btrfs_pin_reserved_extent
There is only one caller of btrfs_pin_reserved_extent that expands the
parameters to extent buffer members. We can simply pass the extent
buffer instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
9580503bcb btrfs: reformat remaining kdoc style comments
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
Josef Bacik
03e8634896 btrfs: remove btrfs_crc32c wrapper
This simply sends the same arguments into crc32c(), and is just used in
a few places.  Remove this wrapper and directly call crc32c() in these
instances.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00
Filipe Manana
8ec0a4a577 btrfs: log message if extent item not found when running delayed extent op
When running a delayed extent operation, if we don't find the extent item
in the extent tree we just return -EIO without any logged message. This
indicates some bug or possibly a memory or fs corruption, so the return
value should not be -EIO but -EUCLEAN instead, and since it's not expected
to ever happen, print an informative error message so that if it happens
we have some idea of what went wrong, where to look at.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20 20:42:58 +02:00
Filipe Manana
d2f79e6385 btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref()
At __btrfs_inc_extent_ref() we are doing a BUG_ON() if we are dealing with
a tree block reference that has a reference count that is different from 1,
but we have already dealt with this case at run_delayed_tree_ref(), making
it useless. So remove the BUG_ON().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20 20:42:47 +02:00
Filipe Manana
1bf76df3fe btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1
When running a delayed tree reference, if we find a ref count different
from 1, we return -EIO. This isn't an IO error, as it indicates either a
bug in the delayed refs code or a memory corruption, so change the error
code from -EIO to -EUCLEAN. Also tag the branch as 'unlikely' as this is
not expected to ever happen, and change the error message to print the
tree block's bytenr without the parenthesis (and there was a missing space
between the 'block' word and the opening parenthesis), for consistency as
that's the style we used everywhere else.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20 20:42:33 +02:00
Qu Wenruo
182741d287 btrfs: remove v0 extent handling
The v0 extent item has been deprecated for a long time, and we don't have
any report from the community either.

So it's time to remove the v0 extent specific error handling, and just
treat them as regular extent tree corruption.

This patch would remove the btrfs_print_v0_err() helper, and enhance the
involved error handling to treat them just as any extent tree
corruption. No reports regarding v0 extents have been seen since the
graceful handling was added in 2018.

This involves:

- btrfs_backref_add_tree_node()
  This change is a little tricky, the new code is changed to only handle
  BTRFS_TREE_BLOCK_REF_KEY and BTRFS_SHARED_BLOCK_REF_KEY.

  But this is safe, as we have rejected any unknown inline refs through
  btrfs_get_extent_inline_ref_type().
  For keyed backrefs, we're safe to skip anything we don't know (that's
  if it can pass tree-checker in the first place).

- btrfs_lookup_extent_info()
- lookup_inline_extent_backref()
- run_delayed_extent_op()
- __btrfs_free_extent()
- add_tree_block()
  Regular error handling of unexpected extent tree item, and abort
  transaction (if we have a trans handle).

- remove_extent_data_ref()
  It's pretty much the same as the regular rejection of unknown backref
  key.
  But for this particular case, we can also remove a BUG_ON().

- extent_data_ref_count()
  We can remove the BTRFS_EXTENT_REF_V0_KEY BUG_ON(), as it would be
  rejected by the only caller.

- btrfs_print_leaf()
  Remove the handling for BTRFS_EXTENT_REF_V0_KEY.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:48 +02:00
Qu Wenruo
7f72f50547 btrfs: output extra debug info if we failed to find an inline backref
[BUG]
Syzbot reported several warning triggered inside
lookup_inline_extent_backref().

[CAUSE]
As usual, the reproducer doesn't reliably trigger locally here, but at
least we know the WARN_ON() is triggered when an inline backref can not
be found, and it can only be triggered when @insert is true. (I.e.
inserting a new inline backref, which means the backref should already
exist)

[ENHANCEMENT]
After the WARN_ON(), dump all the parameters and the extent tree
leaf to help debug.

Link: https://syzkaller.appspot.com/bug?extid=d6f9ff86c1d804ba2bc6
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:48 +02:00
Naohiro Aota
332581bde2 btrfs: zoned: do not zone finish data relocation block group
When multiple writes happen at once, we may need to sacrifice a currently
active block group to be zone finished for a new allocation. We choose a
block group with the least free space left, and zone finish it.

To do the finishing, we need to send IOs for already allocated region
and wait for them and on-going IOs. Otherwise, these IOs fail because the
zone is already finished at the time the IO reach a device.

However, if a block group dedicated to the data relocation is zone
finished, there is a chance that finishing it before an ongoing write IO
reaches the device. That is because there is timing gap between an
allocation is done (block_group->reservations == 0, as pre-allocation is
done) and an ordered extent is created when the relocation IO starts.
Thus, if we finish the zone between them, we can fail the IOs.

We cannot simply use "fs_info->data_reloc_bg == block_group->start" to
avoid the zone finishing. Because, the data_reloc_bg may already switch to
a new block group, while there are still ongoing write IOs to the old
data_reloc_bg.

So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to
indicate there is a data relocation allocation and/or ongoing write to the
block group. The bit is set on allocation and cleared in end_io function of
the last IO for the currently allocated region.

To change the timing of the bit setting also solves the issue that the bit
being left even after there is no IO going on. With the current code, if
the data_reloc_bg switches after the last IO to the current data_reloc_bg,
the bit is set at this timing and there is no one clearing that bit. As a
result, that block group is kept unallocatable for anything.

Fixes: 343d8a30851c ("btrfs: zoned: prevent allocation from previous data relocation BG")
Fixes: 74e91b12b115 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:47 +02:00
Josef Bacik
cd361199ff btrfs: wait on uncached block groups on every allocation loop
My initial fix for the generic/475 hangs was related to metadata, but
our CI testing uncovered another case where we hang for similar reasons.
We again have a task with a plug that is holding an outstanding request
that is keeping the dm device from finishing it's suspend, and that task
is stuck in the allocator.

This time it is stuck trying to allocate data, but we do not have a
block group that matches the size class.  The larger loop in the
allocator looks like this (simplified of course)

  find_free_extent
    for_each_block_group {
      ffe_ctl->cached == btrfs_block_group_cache_done(bg)
      if (!ffe_ctl->cached)
	ffe_ctl->have_caching_bg = true;
      do_allocation()
	btrfs_wait_block_group_cache_progress();
    }

    if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
      go search again;

In my earlier fix we were trying to allocate from the block group, but
we weren't waiting for the progress because we were only waiting for the
free space to be >= the amount of free space we wanted.  My fix made it
so we waited for forward progress to be made as well, so we would be
sure to wait.

This time however we did not have a block group that matched our size
class, so what was happening was this

  find_free_extent
    for_each_block_group {
      ffe_ctl->cached == btrfs_block_group_cache_done(bg)
      if (!ffe_ctl->cached)
	ffe_ctl->have_caching_bg = true;
      if (size_class_doesn't_match())
	goto loop;
      do_allocation()
	btrfs_wait_block_group_cache_progress();
  loop:
      release_block_group(block_group);
    }

    if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
      go search again;

The size_class_doesn't_match() part was true, so we'd just skip this
block group and never wait for caching, and then because we found a
caching block group we'd just go back and do the loop again.  We never
sleep and thus never flush the plug and we have the same deadlock.

Fix the logic for waiting on the block group caching to instead do it
unconditionally when we goto loop.  This takes the logic out of the
allocation step, so now the loop looks more like this

  find_free_extent
    for_each_block_group {
      ffe_ctl->cached == btrfs_block_group_cache_done(bg)
      if (!ffe_ctl->cached)
	ffe_ctl->have_caching_bg = true;
      if (size_class_doesn't_match())
	goto loop;
      do_allocation()
	btrfs_wait_block_group_cache_progress();
  loop:
      if (loop > LOOP_CACHING_NOWAIT && !ffe_ctl->retry_uncached &&
	  !ffe_ctl->cached) {
	 ffe_ctl->retry_uncached = true;
	 btrfs_wait_block_group_cache_progress();
      }

      release_block_group(block_group);
    }

    if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
      go search again;

This simplifies the logic a lot, and makes sure that if we're hitting
uncached block groups we're always waiting on them at some point.

I ran this through 100 iterations of generic/475, as this particular
case was harder to hit than the previous one.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:47 +02:00
Qu Wenruo
257614301a btrfs: handle errors properly in update_inline_extent_backref()
[PROBLEM]
Inside function update_inline_extent_backref(), we have several
BUG_ON()s along with some ASSERT()s which can be triggered by corrupted
filesystem.

[ANAYLYSE]
Most of those BUG_ON()s and ASSERT()s are just a way of handling
unexpected on-disk data.

Although we have tree-checker to rule out obviously incorrect extent
tree blocks, it's not enough for these ones.  Thus we need proper error
handling for them.

[FIX]
Thankfully all the callers of update_inline_extent_backref() would
eventually handle the errror by aborting the current transaction.
So this patch would do the proper error handling by:

- Make update_inline_extent_backref() to return int
  The return value would be either 0 or -EUCLEAN.

- Replace BUG_ON()s and ASSERT()s with proper error handling
  This includes:
  * Dump the bad extent tree leaf
  * Output an error message for the cause
    This would include the extent bytenr, num_bytes (if needed), the bad
    values and expected good values.
  * Return -EUCLEAN

  Note here we remove all the WARN_ON()s, as eventually the transaction
  would be aborted, thus a backtrace would be triggered anyway.

- Better comments on why we expect refs == 1 and refs_to_mode == -1 for
  tree blocks

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:20 +02:00
Naohiro Aota
5a7d107e5e btrfs: zoned: don't activate non-DATA BG on allocation
Now that a non-DATA block group is activated at write time, don't
activate it on allocation time.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:19 +02:00
Josef Bacik
b9d97cff25 btrfs: move comments to btrfs_loop_type definition
Some of these loop types aren't described, and they should be with the
definitions to make it easier to tell what each of them do.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:16 +02:00
Filipe Manana
98b5a8fd2a btrfs: move btrfs_free_excluded_extents() into block-group.c
The function btrfs_free_excluded_extents() is only used by block-group.c,
so move it into block-group.c and make it static. Also removed unnecessary
variables that are used only once.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Filipe Manana
b1c8f527fe btrfs: open code trivial btrfs_add_excluded_extent()
The code for btrfs_add_excluded_extent() is trivial, it's just a
set_extent_bit() call. However it's defined in extent-tree.c but it is
only used (twice) in block-group.c. So open code it in block-group.c,
reducing the need to export a trivial function.

Also since the only caller btrfs_add_excluded_extent() is prepared to
deal with errors, stop ignoring errors from the set_extent_bit() call.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Filipe Manana
e5860f8207 btrfs: make find_first_extent_bit() return a boolean
Currently find_first_extent_bit() returns a 0 if it found a range in the
given io tree and 1 if it didn't find any. There's no need to return any
errors, so make the return value a boolean and invert the logic to make
more sense: return true if it found a range and false if it didn't find
any range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:12 +02:00
Josef Bacik
92fb94b69c btrfs: set cache_block_group_error if we find an error
We set cache_block_group_error if btrfs_cache_block_group() returns an
error, this is because we could end up not finding space to allocate and
mistakenly return -ENOSPC, and which could then abort the transaction
with the incorrect errno, and in the case of ENOSPC result in a
WARN_ON() that will trip up tests like generic/475.

However there's the case where multiple threads can be racing, one
thread gets the proper error, and the other thread doesn't actually call
btrfs_cache_block_group(), it instead sees ->cached ==
BTRFS_CACHE_ERROR.  Again the result is the same, we fail to allocate
our space and return -ENOSPC.  Instead we need to set
cache_block_group_error to -EIO in this case to make sure that if we do
not make our allocation we get the appropriate error returned back to
the caller.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10 17:16:45 +02:00
Filipe Manana
61c681fef7 btrfs: use bool type for delayed ref head fields that are used as booleans
There's no point in have several fields defined as 1 bit unsigned int in
struct btrfs_delayed_ref_head, we can instead use a bool type, it makes
the code a bit more readable and it doesn't change the structure size.
So switch them to proper booleans.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Filipe Manana
4d34ad34d7 btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node
The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node,
as we can check whether a reference is in the tree or not simply by
checking its red black tree node member with RB_EMPTY_NODE(), as when we
remove it from the tree we always call RB_CLEAR_NODE(). So remove that
field and use RB_EMPTY_NODE().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
David Sterba
1d12680044 btrfs: drop gfp from parameter extent state helpers
Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.

Net effect on module on a release build:

   text    data     bss     dec     hex filename
1250432   20985   16088 1287505  13a551 pre/btrfs.ko
1247074   20985   16088 1284147  139833 post/btrfs.ko

DELTA: -3358

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
7dde7a8ab3 btrfs: drop NOFAIL from set_extent_bit allocation masks
The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
(commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
reservation")), without any explanation why it would be needed.

Meanwhile we've updated the semantics of set_extent_bit to handle failed
allocations and do unlock, sleep and retry if needed.  The use of the
NOFAIL flag is also an outlier, we never want any of the set/clear
extent bit helpers to fail, they're used for many critical changes like
extent locking, besides the extent state bit changes.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
0acd32c294 btrfs: open code set_extent_bits
This helper calls set_extent_bit with two more parameters set to default
values, but otherwise it's purpose is not clear.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
fe1a598c42 btrfs: open code set_extent_dirty
The helper is used a few times, that it's setting the DIRTY extent bit
is still clear.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
eea8686e68 btrfs: open code set_extent_new
The helper is used only once.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
Filipe Manana
618d1d7da5 btrfs: fix comment referring to no longer existing btrfs_clean_tree_block()
There's a comment at btrfs_init_new_buffer() that refers to a function
named btrfs_clean_tree_block(), however the function was renamed to
btrfs_clear_buffer_dirty() in commit 190a83391bc4 ("btrfs: rename
btrfs_clean_tree_block to btrfs_clear_buffer_dirty"). So update the
comment to refer to the current name.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Josef Bacik
4aec05fa5a btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing
in the eb already, so simply get the level from the eb inside of
btrfs_set_block_flags.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Anand Jain
adbe7e388e btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
eee3b81178 btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:

- Always dump the leaf first, then the error message

- Output the slot number if possible
  Especially in __btrfs_free_extent() the leaf dump of extent tree can
  be pretty large.
  With an extra slot number it's much easier to locate the problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
6c75a589cb btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.

This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
a8fdc05172 btrfs: remove obsolete delayed ref throttling logic when truncating items
We have this logic encapsulated in btrfs_should_throttle_delayed_refs()
where we try to estimate if running the current amount of delayed
references we have will take more than half a second, and if so, the
caller btrfs_should_throttle_delayed_refs() should do something to
prevent more and more delayed refs from being accumulated.

This logic was added in commit 0a2b2a844af6 ("Btrfs: throttle delayed
refs better") and then further refined in commit a79b7d4b3e81 ("Btrfs:
async delayed refs"). The idea back then was that the caller of
btrfs_should_throttle_delayed_refs() would release its transaction
handle (by calling btrfs_end_transaction()) when that function returned
true, then btrfs_end_transaction() would trigger an async job to run
delayed references in a workqueue, and later start/join a transaction
again and do more work.

However we don't run delayed references asynchronously anymore, that
was removed in commit db2462a6ad3d ("btrfs: don't run delayed refs in
the end transaction logic"). That makes the logic that tries to estimate
how long we will take to run our current delayed references, at
btrfs_should_throttle_delayed_refs(), pointless as we don't take any
action to run delayed references anymore. We do have other type of
throttling, which consists of checking the size and reserved space of
the delayed and global block reserves, as well as if fluhsing delayed
references for the current transaction was already started, etc - this
is all done by btrfs_should_end_transaction(), and the only user of
btrfs_should_throttle_delayed_refs() does periodically call
btrfs_should_end_transaction().

So remove btrfs_should_throttle_delayed_refs() and the infrastructure
that keeps track of the average time used for running delayed references,
as well as adapting btrfs_truncate_inode_items() to call
btrfs_check_space_for_delayed_refs() instead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Josef Bacik
9a93b5a353 btrfs: abort the transaction if we get an error during snapshot drop
We were seeing weird errors when we were testing our btrfs backports
before we had the incorrect level check fix.  These errors appeared to
be improper error handling, but error injection testing uncovered that
the errors were a result of corruption that occurred from improper error
handling during snapshot delete.

With snapshot delete if we encounter any errors during walk_down or
walk_up we'll simply return an error, we won't abort the transaction.
This is problematic because we will be dropping references for nodes and
leaves along the way, and if we fail in the middle we will leave the
file system corrupt because we don't know where we left off in the drop.

Fix this by making sure we abort if we hit any errors during the walk
down or walk up operations, as we have no idea what operations could
have been left half done at this point.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:13 +02:00
Josef Bacik
4e19438400 btrfs: handle errors in walk_down_tree properly
We can get errors in walk_down_proc as we try and lookup extent info for
the snapshot dropping to act on.  However if we get an error we simply
return 1 which indicates we're done with walking down, which will lead
us to improperly continue with the snapshot drop with the incorrect
information.  Instead break if we get any error from walk_down_proc or
do_walk_down, and handle the case of ret == 1 by returning 0, otherwise
return the ret value that we have.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:13 +02:00
Josef Bacik
190a83391b btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirty
btrfs_clean_tree_block is a misnomer, it's just
clear_extent_buffer_dirty with some extra accounting around it.  Rename
this to btrfs_clear_buffer_dirty to make it more clear it belongs with
it's setter, btrfs_mark_buffer_dirty.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Josef Bacik
ed25dab3a0 btrfs: add trans argument to btrfs_clean_tree_block
We check the header generation in the extent buffer against the current
running transaction id to see if it's safe to clear DIRTY on this
buffer.  Generally speaking if we're clearing the buffer dirty we're
holding the transaction open, but in the case of cleaning up an aborted
transaction we don't, so we have extra checks in that path to check the
transid.  To allow for a future cleanup go ahead and pass in the trans
handle so we don't have to rely on ->running_transaction being set.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Josef Bacik
d3fb66150c btrfs: always lock the block before calling btrfs_clean_tree_block
We want to clean up the dirty handling for extent buffers so it's a
little more consistent, so skip the check for generation == transid and
simply always lock the extent buffer before calling btrfs_clean_tree_block.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
ye xingchen
58e36c2a01 btrfs: remove duplicate include header in extent-tree.c
extent-tree.h is included more than once, added in a0231804affe ("btrfs:
move extent-tree helpers into their own header file").

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Boris Burkov
cb0922f264 btrfs: don't use size classes for zoned file systems
When a file system has ZNS devices which are constrained by a maximum
number of active block groups, then not being able to use all the block
groups for every allocation is not ideal, and could cause us to loop a
ton with mixed size allocations.

In general, since zoned doesn't write into gaps behind where block
groups are writing, it is not susceptible to the same sort of
fragmentation that size classes are designed to solve, so we can skip
size classes for zoned file systems in general, even though there would
probably be no harm for SMR devices.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Boris Burkov
52bb7a2166 btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.

This patch categorizes extents into size classes:

- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"

and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.

Size classes are implemented in the following way:

- Mark each new block group with a size class of the first allocation
  that goes into it.

- Add two new passes to ffe: "unset size class" and "wrong size class".
  First, try only matching block groups, then try unset ones, then allow
  allocation of new ones, and finally allow mismatched block groups.

- Filtering is done just by skipping inappropriate ones, there is no
  special size class indexing.

Other solutions I considered were:

- A best fit allocator with an rb-tree. This worked well, as small
  writes didn't leak big holes from large freed extents, but led to
  regressions in ffe and write performance due to lock contention on
  the rb-tree with every allocation possibly updating it in parallel.
  Perhaps something clever could be done to do the updates in the
  background while being "right enough".

- A fixed size "working set". This prevents freeing an extent
  drastically changing where writes currently land, and seems like a
  good option too. Doesn't take advantage of size in any way.

- The same size class idea, but implemented with xarray marks. This
  turned out to be slower than looping the linked list and skipping
  wrong block groups, and is also less flexible since we must have only
  3 size classes (max #marks). With the current approach we can have as
  many as we like.

Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.

A brief summary of the testing results:

- Neutral results on existing tests. There are some minor regressions
  and improvements here and there, but nothing that truly stands out as
  notable.
- Improvement on new tests where size class and extent lifetime are
  correlated. Fragmentation in these cases is completely eliminated
  and write performance is generally a little better. There is also
  significant improvement where extent sizes are just a bit larger than
  the size class boundaries.
- Regression on one new tests: where the allocations are sized
  intentionally a hair under the borders of the size classes. Results
  are neutral on the test that intentionally attacks this new scheme by
  mixing extent size and lifetime.

The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)

Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:

bufferedappendvsfallocate results
         metric             baseline       current        stdev            diff
======================================================================================
avg_commit_ms                    31.13         29.20          2.67     -6.22%
bg_count                            14         15.60             0     11.43%
commits                          11.10         12.20          0.32      9.91%
elapsed                          27.30         26.40          2.98     -3.30%
end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
find_free_extent_ns_min            500           512          6.67      2.40%
find_free_extent_ns_p50           2848          2876         37.65      0.98%
find_free_extent_ns_p95           4916          5000         75.45      1.71%
find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
frag_pct_max                     61.67             0          8.05   -100.00%
frag_pct_mean                    43.59             0          6.10   -100.00%
frag_pct_min                     25.91             0         16.60   -100.00%
frag_pct_p50                     42.53             0          7.25   -100.00%
frag_pct_p95                     61.67             0          8.05   -100.00%
frag_pct_p99                     61.67             0          8.05   -100.00%
fragmented_bg_count               6.10             0          1.45   -100.00%
max_commit_ms                    49.80            46          5.37     -7.63%
sys_cpu                           2.59          2.62          0.29      1.39%
write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
write_io_kbytes                4194304       4194304             0      0.00%
write_iops                     2476.15       2556.10        274.29      3.23%
write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
write_lat_ns_min              17353.10         16250       1646.08     -6.36%

There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.

On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.

Some considerations for future work:

- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Boris Burkov
854c2f365d btrfs: add more find_free_extent tracepoints
find_free_extent is a complicated function. It consists (at least) of:

- a hint that jumps into the middle of a for loop macro
- a middle loop trying every raid level
- an outer loop ascending through ffe loop levels
- complicated logic for skipping some of those ffe loop levels
- multiple underlying in-bg allocators (zoned, cluster, no cluster)

Which is all to say that more tracing is helpful for debugging its
behavior. Add two new tracepoints: at the entrance to the block_groups
loop (hit for every raid level and every ffe_ctl loop) and at the point
we seriously consider a block_group for allocation. This way we can see
the whole path through the algorithm, including hints, multiple loops,
etc.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Boris Burkov
cfc2de0fce btrfs: pass find_free_extent_ctl to allocator tracepoints
The allocator tracepoints currently have a pile of values from ffe_ctl.
In modifying the allocator and adding more tracepoints, I found myself
adding to the already long argument list of the tracepoints. It makes it
a lot simpler to just send in the ffe_ctl itself.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Johannes Thumshirn
0c555c97ef btrfs: directly pass in fs_info to btrfs_merge_delayed_refs
Now that none of the functions called by btrfs_merge_delayed_refs() needs
a btrfs_trans_handle, directly pass in a btrfs_fs_info to
btrfs_merge_delayed_refs().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:33 +01:00
Qu Wenruo
39f501d68e btrfs: always report error in run_one_delayed_ref()
Currently we have a btrfs_debug() for run_one_delayed_ref() failure, but
if end users hit such problem, there will be no chance that
btrfs_debug() is enabled.  This can lead to very little useful info for
debugging.

This patch will:

- Add extra info for error reporting
  Including:
  * logical bytenr
  * num_bytes
  * type
  * action
  * ref_mod

- Replace the btrfs_debug() with btrfs_err()

- Move the error reporting into run_one_delayed_ref()
  This is to avoid use-after-free, the @node can be freed in the caller.

This error should only be triggered at most once.

As if run_one_delayed_ref() failed, we trigger the error message, then
causing the call chain to error out:

btrfs_run_delayed_refs()
`- btrfs_run_delayed_refs()
   `- btrfs_run_delayed_refs_for_head()
      `- run_one_delayed_ref()

And we will abort the current transaction in btrfs_run_delayed_refs().
If we have to run delayed refs for the abort transaction,
run_one_delayed_ref() will just cleanup the refs and do nothing, thus no
new error messages would be output.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-03 16:22:10 +01:00
Christoph Hellwig
103c19723c btrfs: split the bio submission path into a separate file
The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c.  Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.

Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Qu Wenruo
cb3e217bdb btrfs: use btrfs_dev_name() helper to handle missing devices better
[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:

 BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
 BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)

Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".

[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).

Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.

[FIX]
Fix the output by:

- Move the btrfs_dev_name() helper to volumes.h

- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
  Only zoned code is not touched, as I'm not familiar with degraded
  zoned code.

- Constify return value and parameter

Now the output looks pretty sane:

 BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
 BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Qu Wenruo
789d6a3a87 btrfs: concentrate all tree block parentness check parameters into one structure
There are several different tree block parentness check parameters used
across several helpers:

- level
  Mandatory

- transid
  Under most cases it's mandatory, but there are several backref cases
  which skips this check.

- owner_root
- first_key
  Utilized by most top-down tree search routine. Otherwise can be
  skipped.

Those four members are not always mandatory checks, and some of them are
the same u64, which means if some arguments got swapped compiler will
not catch it.

Furthermore if we're going to further expand the parentness check, we
need to modify quite some helpers just to add one more parameter.

This patch will concentrate all these members into a structure called
btrfs_tree_parent_check, and pass that structure for the following
helpers:

- btrfs_read_extent_buffer()
- read_tree_block()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:56 +01:00
Josef Bacik
aa5d3003dd btrfs: move orphan prototypes into orphan.h
Move these out of ctree.h into orphan.h to cut down on code in ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
7c8ede1628 btrfs: move file-item prototypes into their own header
Move these prototypes out of ctree.h and into file-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00