2019-06-19 20:47:22 +03:00
// SPDX-License-Identifier: GPL-2.0
2019-08-21 19:54:28 +03:00
# include "misc.h"
2019-06-19 20:47:22 +03:00
# include "ctree.h"
# include "block-rsv.h"
# include "space-info.h"
2019-06-19 20:47:23 +03:00
# include "transaction.h"
2020-03-13 22:28:48 +03:00
# include "block-group.h"
2021-11-05 23:45:45 +03:00
# include "disk-io.h"
2022-10-19 17:50:53 +03:00
# include "fs.h"
2022-10-19 17:51:00 +03:00
# include "accessors.h"
2019-06-19 20:47:22 +03:00
2020-02-04 21:18:54 +03:00
/*
* HOW DO BLOCK RESERVES WORK
*
* Think of block_rsv ' s as buckets for logically grouped metadata
* reservations . Each block_rsv has a - > size and a - > reserved . - > size is
* how large we want our block rsv to be , - > reserved is how much space is
* currently reserved for this block reserve .
*
* - > failfast exists for the truncate case , and is described below .
*
* NORMAL OPERATION
*
* - > Reserve
* Entrance : btrfs_block_rsv_add , btrfs_block_rsv_refill
*
* We call into btrfs_reserve_metadata_bytes ( ) with our bytes , which is
* accounted for in space_info - > bytes_may_use , and then add the bytes to
* - > reserved , and - > size in the case of btrfs_block_rsv_add .
*
* - > size is an over - estimation of how much we may use for a particular
* operation .
*
* - > Use
* Entrance : btrfs_use_block_rsv
*
* When we do a btrfs_alloc_tree_block ( ) we call into btrfs_use_block_rsv ( )
* to determine the appropriate block_rsv to use , and then verify that
* - > reserved has enough space for our tree block allocation . Once
* successful we subtract fs_info - > nodesize from - > reserved .
*
* - > Finish
* Entrance : btrfs_block_rsv_release
*
* We are finished with our operation , subtract our individual reservation
* from - > size , and then subtract - > size from - > reserved and free up the
* excess if there is any .
*
* There is some logic here to refill the delayed refs rsv or the global rsv
* as needed , otherwise the excess is subtracted from
* space_info - > bytes_may_use .
*
* TYPES OF BLOCK RESERVES
*
* BLOCK_RSV_TRANS , BLOCK_RSV_DELOPS , BLOCK_RSV_CHUNK
* These behave normally , as described above , just within the confines of the
* lifetime of their particular operation ( transaction for the whole trans
* handle lifetime , for example ) .
*
* BLOCK_RSV_GLOBAL
* It is impossible to properly account for all the space that may be required
* to make our extent tree updates . This block reserve acts as an overflow
* buffer in case our delayed refs reserve does not reserve enough space to
* update the extent tree .
*
* We can steal from this in some cases as well , notably on evict ( ) or
* truncate ( ) in order to help users recover from ENOSPC conditions .
*
* BLOCK_RSV_DELALLOC
* The individual item sizes are determined by the per - inode size
* calculations , which are described with the delalloc code . This is pretty
* straightforward , it ' s just the calculation of - > size encodes a lot of
* different items , and thus it gets used when updating inodes , inserting file
* extents , and inserting checksums .
*
* BLOCK_RSV_DELREFS
* We keep a running tally of how many delayed refs we have on the system .
* We assume each one of these delayed refs are going to use a full
* reservation . We use the transaction items and pre - reserve space for every
* operation , and use this reservation to refill any gap between - > size and
* - > reserved that may exist .
*
* From there it ' s straightforward , removing a delayed ref means we remove its
* count from - > size and free up reservations as necessary . Since this is
* the most dynamic block reserve in the system , we will try to refill this
* block reserve first with any excess returned by any other block reserve .
*
* BLOCK_RSV_EMPTY
* This is the fallback block reserve to make us try to reserve space if we
* don ' t have a specific bucket for this allocation . It is mostly used for
* updating the device tree and such , since that is a separate pool we ' re
* content to just reserve space from the space_info on demand .
*
* BLOCK_RSV_TEMP
* This is used by things like truncate and iput . We will temporarily
* allocate a block reserve , set it to some size , and then truncate bytes
* until we have no space left . With - > failfast set we ' ll simply return
* ENOSPC from btrfs_use_block_rsv ( ) to signal that we need to unwind and try
* to make a new reservation . This is because these operations are
* unbounded , so we want to do as much work as we can , and then back off and
* re - reserve .
*/
2019-06-19 20:47:22 +03:00
static u64 block_rsv_release_bytes ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * block_rsv ,
struct btrfs_block_rsv * dest , u64 num_bytes ,
u64 * qgroup_to_release_ret )
{
struct btrfs_space_info * space_info = block_rsv - > space_info ;
u64 qgroup_to_release = 0 ;
u64 ret ;
spin_lock ( & block_rsv - > lock ) ;
if ( num_bytes = = ( u64 ) - 1 ) {
num_bytes = block_rsv - > size ;
qgroup_to_release = block_rsv - > qgroup_rsv_size ;
}
block_rsv - > size - = num_bytes ;
if ( block_rsv - > reserved > = block_rsv - > size ) {
num_bytes = block_rsv - > reserved - block_rsv - > size ;
block_rsv - > reserved = block_rsv - > size ;
2022-06-23 18:08:14 +03:00
block_rsv - > full = true ;
2019-06-19 20:47:22 +03:00
} else {
num_bytes = 0 ;
}
btrfs: don't free qgroup space unless specified
Boris noticed in his simple quotas testing that he was getting a leak
with Sweet Tea's change to subvol create that stopped doing a
transaction commit. This was just a side effect of that change.
In the delayed inode code we have an optimization that will free extra
reservations if we think we can pack a dir item into an already modified
leaf. Previously this wouldn't be triggered in the subvolume create
case because we'd commit the transaction, it was still possible but
much harder to trigger. It could actually be triggered if we did a
mkdir && subvol create with qgroups enabled.
This occurs because in btrfs_insert_delayed_dir_index(), which gets
called when we're adding the dir item, we do the following:
btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL);
if we're able to skip reserving space.
The problem here is that trans->block_rsv points at the temporary block
rsv for the subvolume create, which has qgroup reservations in the block
rsv.
This is a problem because btrfs_block_rsv_release() will do the
following:
if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
qgroup_to_release = block_rsv->qgroup_rsv_reserved -
block_rsv->qgroup_rsv_size;
block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
}
The temporary block rsv just has ->qgroup_rsv_reserved set,
->qgroup_rsv_size == 0. The optimization in
btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then
later on when we call btrfs_subvolume_release_metadata() which has
btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release);
btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release);
qgroup_to_release is set to 0, and we do not convert the reserved
metadata space.
The problem here is that the block rsv code has been unconditionally
messing with ->qgroup_rsv_reserved, because the main place this is used
is delalloc, and any time we call btrfs_block_rsv_release() we do it
with qgroup_to_release set, and thus do the proper accounting.
The subvolume code is the only other code that uses the qgroup
reservation stuff, but it's intermingled with the above optimization,
and thus was getting its reservation freed out from underneath it and
thus leaking the reserved space.
The solution is to simply not mess with the qgroup reservations if we
don't have qgroup_to_release set. This works with the existing code as
anything that messes with the delalloc reservations always have
qgroup_to_release set. This fixes the leak that Boris was observing.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-02 23:00:06 +03:00
if ( qgroup_to_release_ret & &
block_rsv - > qgroup_rsv_reserved > = block_rsv - > qgroup_rsv_size ) {
2019-06-19 20:47:22 +03:00
qgroup_to_release = block_rsv - > qgroup_rsv_reserved -
block_rsv - > qgroup_rsv_size ;
block_rsv - > qgroup_rsv_reserved = block_rsv - > qgroup_rsv_size ;
} else {
qgroup_to_release = 0 ;
}
spin_unlock ( & block_rsv - > lock ) ;
ret = num_bytes ;
if ( num_bytes > 0 ) {
if ( dest ) {
spin_lock ( & dest - > lock ) ;
if ( ! dest - > full ) {
u64 bytes_to_add ;
bytes_to_add = dest - > size - dest - > reserved ;
bytes_to_add = min ( num_bytes , bytes_to_add ) ;
dest - > reserved + = bytes_to_add ;
if ( dest - > reserved > = dest - > size )
2022-06-23 18:08:14 +03:00
dest - > full = true ;
2019-06-19 20:47:22 +03:00
num_bytes - = bytes_to_add ;
}
spin_unlock ( & dest - > lock ) ;
}
if ( num_bytes )
2019-08-22 22:11:02 +03:00
btrfs_space_info_free_bytes_may_use ( fs_info ,
space_info ,
num_bytes ) ;
2019-06-19 20:47:22 +03:00
}
if ( qgroup_to_release_ret )
* qgroup_to_release_ret = qgroup_to_release ;
return ret ;
}
int btrfs_block_rsv_migrate ( struct btrfs_block_rsv * src ,
struct btrfs_block_rsv * dst , u64 num_bytes ,
bool update_size )
{
int ret ;
ret = btrfs_block_rsv_use_bytes ( src , num_bytes ) ;
if ( ret )
return ret ;
btrfs_block_rsv_add_bytes ( dst , num_bytes , update_size ) ;
return 0 ;
}
2022-06-23 18:15:37 +03:00
void btrfs_init_block_rsv ( struct btrfs_block_rsv * rsv , enum btrfs_rsv_type type )
2019-06-19 20:47:22 +03:00
{
memset ( rsv , 0 , sizeof ( * rsv ) ) ;
spin_lock_init ( & rsv - > lock ) ;
rsv - > type = type ;
}
void btrfs_init_metadata_block_rsv ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * rsv ,
2022-06-23 18:15:37 +03:00
enum btrfs_rsv_type type )
2019-06-19 20:47:22 +03:00
{
btrfs_init_block_rsv ( rsv , type ) ;
rsv - > space_info = btrfs_find_space_info ( fs_info ,
BTRFS_BLOCK_GROUP_METADATA ) ;
}
struct btrfs_block_rsv * btrfs_alloc_block_rsv ( struct btrfs_fs_info * fs_info ,
2022-06-23 18:15:37 +03:00
enum btrfs_rsv_type type )
2019-06-19 20:47:22 +03:00
{
struct btrfs_block_rsv * block_rsv ;
block_rsv = kmalloc ( sizeof ( * block_rsv ) , GFP_NOFS ) ;
if ( ! block_rsv )
return NULL ;
btrfs_init_metadata_block_rsv ( fs_info , block_rsv , type ) ;
return block_rsv ;
}
void btrfs_free_block_rsv ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * rsv )
{
if ( ! rsv )
return ;
2020-03-10 11:59:31 +03:00
btrfs_block_rsv_release ( fs_info , rsv , ( u64 ) - 1 , NULL ) ;
2019-06-19 20:47:22 +03:00
kfree ( rsv ) ;
}
2021-11-09 18:12:07 +03:00
int btrfs_block_rsv_add ( struct btrfs_fs_info * fs_info ,
2019-06-19 20:47:22 +03:00
struct btrfs_block_rsv * block_rsv , u64 num_bytes ,
enum btrfs_reserve_flush_enum flush )
{
int ret ;
if ( num_bytes = = 0 )
return 0 ;
2023-09-08 20:20:20 +03:00
ret = btrfs_reserve_metadata_bytes ( fs_info , block_rsv - > space_info ,
num_bytes , flush ) ;
2019-06-19 20:47:22 +03:00
if ( ! ret )
btrfs_block_rsv_add_bytes ( block_rsv , num_bytes , true ) ;
return ret ;
}
2022-10-27 00:25:14 +03:00
int btrfs_block_rsv_check ( struct btrfs_block_rsv * block_rsv , int min_percent )
2019-06-19 20:47:22 +03:00
{
u64 num_bytes = 0 ;
int ret = - ENOSPC ;
spin_lock ( & block_rsv - > lock ) ;
2022-10-27 00:25:14 +03:00
num_bytes = mult_perc ( block_rsv - > size , min_percent ) ;
2019-06-19 20:47:22 +03:00
if ( block_rsv - > reserved > = num_bytes )
ret = 0 ;
spin_unlock ( & block_rsv - > lock ) ;
return ret ;
}
2021-11-09 18:12:07 +03:00
int btrfs_block_rsv_refill ( struct btrfs_fs_info * fs_info ,
2023-03-21 14:13:48 +03:00
struct btrfs_block_rsv * block_rsv , u64 num_bytes ,
2019-06-19 20:47:22 +03:00
enum btrfs_reserve_flush_enum flush )
{
int ret = - ENOSPC ;
if ( ! block_rsv )
return 0 ;
spin_lock ( & block_rsv - > lock ) ;
if ( block_rsv - > reserved > = num_bytes )
ret = 0 ;
else
num_bytes - = block_rsv - > reserved ;
spin_unlock ( & block_rsv - > lock ) ;
if ( ! ret )
return 0 ;
2023-09-08 20:20:20 +03:00
ret = btrfs_reserve_metadata_bytes ( fs_info , block_rsv - > space_info ,
num_bytes , flush ) ;
2019-06-19 20:47:22 +03:00
if ( ! ret ) {
btrfs_block_rsv_add_bytes ( block_rsv , num_bytes , false ) ;
return 0 ;
}
return ret ;
}
2020-03-10 11:59:31 +03:00
u64 btrfs_block_rsv_release ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * block_rsv , u64 num_bytes ,
u64 * qgroup_to_release )
2019-06-19 20:47:22 +03:00
{
struct btrfs_block_rsv * global_rsv = & fs_info - > global_block_rsv ;
struct btrfs_block_rsv * delayed_rsv = & fs_info - > delayed_refs_rsv ;
struct btrfs_block_rsv * target = NULL ;
/*
btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:
1) When we start a transaction we reserve space for the number of items
the caller wants to be inserted/modified/deleted in a btree. This space
goes to the transaction block reserve;
2) If the delayed refs block reserve is not full, its size is greater
than the amount of its reserved space, and the flush method is
BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
it corresponding to the number of items the caller wants to
insert/modify/delete in a btree;
3) The size of the delayed refs block reserve is increased when a task
creates delayed refs after COWing an extent buffer, allocating a new
one or deleting (freeing) an extent buffer. This happens after the
the task started or joined a transaction, whenever it calls
btrfs_update_delayed_refs_rsv();
4) The delayed refs block reserve is then refilled by anyone calling
btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
operations or when someone else calls btrfs_start_transaction() with
a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
5) As a task COWs or allocates extent buffers, it consumes space from the
transaction block reserve. When the task releases its transaction
handle (btrfs_end_transaction()) or it attempts to commit the
transaction, it releases any remaining space in the transaction block
reserve that it did not use, as not all space may have been used (due
to pessimistic space calculation) by calling btrfs_block_rsv_release()
which will try to add that unused space to the delayed refs block
reserve (if its current size is greater than its reserved space).
That transferred space may not be enough to completely fulfill the
delayed refs block reserve.
Plus we have some tasks that will attempt do modify as many leaves
as they can before getting -ENOSPC (and then reserving more space and
retrying), such as hole punching and extent cloning which call
btrfs_replace_file_extents(). Such tasks can generate therefore a
high number of delayed refs, for both metadata and data (we can't
know in advance how many file extent items we will find in a range
and therefore how many delayed refs for dropping references on data
extents we will generate);
6) If a transaction starts its commit before the delayed refs block
reserve is refilled, for example by the transaction kthread or by
someone who called btrfs_join_transaction() before starting the
commit, then when running delayed references if we don't have enough
reserved space in the delayed refs block reserve, we will consume
space from the global block reserve.
Now this doesn't make a lot of sense because:
1) We should reserve space for delayed references when starting the
transaction, since we have no guarantees the delayed refs block
reserve will be refilled;
2) If no refill happens then we will consume from the global block reserve
when running delayed refs during the transaction commit;
3) If we have a bunch of tasks calling btrfs_start_transaction() with a
number of items greater than zero and at the time the delayed refs
reserve is full, then we don't reserve any space at
btrfs_start_transaction() for the delayed refs that will be generated
by a task, and we can therefore end up using a lot of space from the
global reserve when running the delayed refs during a transaction
commit;
4) There are also other operations that result in bumping the size of the
delayed refs reserve, such as creating and deleting block groups, as
well as the need to update a block group item because we allocated or
freed an extent from the respective block group;
5) If we have a significant gap between the delayed refs reserve's size
and its reserved space, two very bad things may happen:
1) The reserved space of the global reserve may not be enough and we
fail the transaction commit with -ENOSPC when running delayed refs;
2) If the available space in the global reserve is enough it may result
in nearly exhausting it. If the fs has no more unallocated device
space for allocating a new block group and all the available space
in existing metadata block groups is not far from the global
reserve's size before we started the transaction commit, we may end
up in a situation where after the transaction commit we have too
little available metadata space, and any future transaction commit
will fail with -ENOSPC, because although we were able to reserve
space to start the transaction, we were not able to commit it, as
running delayed refs generates some more delayed refs (to update the
extent tree for example) - this includes not even being able to
commit a transaction that was started with the goal of unlinking a
file, removing an empty data block group or doing reclaim/balance,
so there's no way to release metadata space.
In the worst case the next time we mount the filesystem we may
also fail with -ENOSPC due to failure to commit a transaction to
cleanup orphan inodes. This later case was reported and hit by
someone running a SLE (SUSE Linux Enterprise) distribution for
example - where the fs had no more unallocated space that could be
used to allocate a new metadata block group, and the available
metadata space was about 1.5M, not enough to commit a transaction
to cleanup an orphan inode (or do relocation of data block groups
that were far from being full).
So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 20:20:38 +03:00
* If we are a delayed block reserve then push to the global rsv ,
* otherwise dump into the global delayed reserve if it is not full .
2019-06-19 20:47:22 +03:00
*/
btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:
1) When we start a transaction we reserve space for the number of items
the caller wants to be inserted/modified/deleted in a btree. This space
goes to the transaction block reserve;
2) If the delayed refs block reserve is not full, its size is greater
than the amount of its reserved space, and the flush method is
BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
it corresponding to the number of items the caller wants to
insert/modify/delete in a btree;
3) The size of the delayed refs block reserve is increased when a task
creates delayed refs after COWing an extent buffer, allocating a new
one or deleting (freeing) an extent buffer. This happens after the
the task started or joined a transaction, whenever it calls
btrfs_update_delayed_refs_rsv();
4) The delayed refs block reserve is then refilled by anyone calling
btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
operations or when someone else calls btrfs_start_transaction() with
a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
5) As a task COWs or allocates extent buffers, it consumes space from the
transaction block reserve. When the task releases its transaction
handle (btrfs_end_transaction()) or it attempts to commit the
transaction, it releases any remaining space in the transaction block
reserve that it did not use, as not all space may have been used (due
to pessimistic space calculation) by calling btrfs_block_rsv_release()
which will try to add that unused space to the delayed refs block
reserve (if its current size is greater than its reserved space).
That transferred space may not be enough to completely fulfill the
delayed refs block reserve.
Plus we have some tasks that will attempt do modify as many leaves
as they can before getting -ENOSPC (and then reserving more space and
retrying), such as hole punching and extent cloning which call
btrfs_replace_file_extents(). Such tasks can generate therefore a
high number of delayed refs, for both metadata and data (we can't
know in advance how many file extent items we will find in a range
and therefore how many delayed refs for dropping references on data
extents we will generate);
6) If a transaction starts its commit before the delayed refs block
reserve is refilled, for example by the transaction kthread or by
someone who called btrfs_join_transaction() before starting the
commit, then when running delayed references if we don't have enough
reserved space in the delayed refs block reserve, we will consume
space from the global block reserve.
Now this doesn't make a lot of sense because:
1) We should reserve space for delayed references when starting the
transaction, since we have no guarantees the delayed refs block
reserve will be refilled;
2) If no refill happens then we will consume from the global block reserve
when running delayed refs during the transaction commit;
3) If we have a bunch of tasks calling btrfs_start_transaction() with a
number of items greater than zero and at the time the delayed refs
reserve is full, then we don't reserve any space at
btrfs_start_transaction() for the delayed refs that will be generated
by a task, and we can therefore end up using a lot of space from the
global reserve when running the delayed refs during a transaction
commit;
4) There are also other operations that result in bumping the size of the
delayed refs reserve, such as creating and deleting block groups, as
well as the need to update a block group item because we allocated or
freed an extent from the respective block group;
5) If we have a significant gap between the delayed refs reserve's size
and its reserved space, two very bad things may happen:
1) The reserved space of the global reserve may not be enough and we
fail the transaction commit with -ENOSPC when running delayed refs;
2) If the available space in the global reserve is enough it may result
in nearly exhausting it. If the fs has no more unallocated device
space for allocating a new block group and all the available space
in existing metadata block groups is not far from the global
reserve's size before we started the transaction commit, we may end
up in a situation where after the transaction commit we have too
little available metadata space, and any future transaction commit
will fail with -ENOSPC, because although we were able to reserve
space to start the transaction, we were not able to commit it, as
running delayed refs generates some more delayed refs (to update the
extent tree for example) - this includes not even being able to
commit a transaction that was started with the goal of unlinking a
file, removing an empty data block group or doing reclaim/balance,
so there's no way to release metadata space.
In the worst case the next time we mount the filesystem we may
also fail with -ENOSPC due to failure to commit a transaction to
cleanup orphan inodes. This later case was reported and hit by
someone running a SLE (SUSE Linux Enterprise) distribution for
example - where the fs had no more unallocated space that could be
used to allocate a new metadata block group, and the available
metadata space was about 1.5M, not enough to commit a transaction
to cleanup an orphan inode (or do relocation of data block groups
that were far from being full).
So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 20:20:38 +03:00
if ( block_rsv - > type = = BTRFS_BLOCK_RSV_DELOPS )
2019-06-19 20:47:22 +03:00
target = global_rsv ;
2022-09-05 19:32:23 +03:00
else if ( block_rsv ! = global_rsv & & ! btrfs_block_rsv_full ( delayed_rsv ) )
2019-06-19 20:47:22 +03:00
target = delayed_rsv ;
if ( target & & block_rsv - > space_info ! = target - > space_info )
target = NULL ;
return block_rsv_release_bytes ( fs_info , block_rsv , target , num_bytes ,
qgroup_to_release ) ;
}
int btrfs_block_rsv_use_bytes ( struct btrfs_block_rsv * block_rsv , u64 num_bytes )
{
int ret = - ENOSPC ;
spin_lock ( & block_rsv - > lock ) ;
if ( block_rsv - > reserved > = num_bytes ) {
block_rsv - > reserved - = num_bytes ;
if ( block_rsv - > reserved < block_rsv - > size )
2022-06-23 18:08:14 +03:00
block_rsv - > full = false ;
2019-06-19 20:47:22 +03:00
ret = 0 ;
}
spin_unlock ( & block_rsv - > lock ) ;
return ret ;
}
void btrfs_block_rsv_add_bytes ( struct btrfs_block_rsv * block_rsv ,
u64 num_bytes , bool update_size )
{
spin_lock ( & block_rsv - > lock ) ;
block_rsv - > reserved + = num_bytes ;
if ( update_size )
block_rsv - > size + = num_bytes ;
else if ( block_rsv - > reserved > = block_rsv - > size )
2022-06-23 18:08:14 +03:00
block_rsv - > full = true ;
2019-06-19 20:47:22 +03:00
spin_unlock ( & block_rsv - > lock ) ;
}
2019-06-19 20:47:23 +03:00
void btrfs_update_global_block_rsv ( struct btrfs_fs_info * fs_info )
{
struct btrfs_block_rsv * block_rsv = & fs_info - > global_block_rsv ;
struct btrfs_space_info * sinfo = block_rsv - > space_info ;
2021-12-02 23:34:31 +03:00
struct btrfs_root * root , * tmp ;
u64 num_bytes = btrfs_root_used ( & fs_info - > tree_root - > root_item ) ;
unsigned int min_items = 1 ;
2019-06-19 20:47:23 +03:00
/*
* The global block rsv is based on the size of the extent tree , the
* checksum tree and the root tree . If the fs is empty we want to set
* it to a minimal amount for safety .
2021-12-02 23:34:31 +03:00
*
* We also are going to need to modify the minimum of the tree root and
* any global roots we could touch .
2019-06-19 20:47:23 +03:00
*/
2021-12-02 23:34:31 +03:00
read_lock ( & fs_info - > global_root_lock ) ;
rbtree_postorder_for_each_entry_safe ( root , tmp , & fs_info - > global_root_tree ,
rb_node ) {
if ( root - > root_key . objectid = = BTRFS_EXTENT_TREE_OBJECTID | |
root - > root_key . objectid = = BTRFS_CSUM_TREE_OBJECTID | |
root - > root_key . objectid = = BTRFS_FREE_SPACE_TREE_OBJECTID ) {
num_bytes + = btrfs_root_used ( & root - > root_item ) ;
min_items + + ;
}
}
read_unlock ( & fs_info - > global_root_lock ) ;
2019-08-22 22:19:00 +03:00
2023-07-20 14:44:33 +03:00
if ( btrfs_fs_compat_ro ( fs_info , BLOCK_GROUP_TREE ) ) {
num_bytes + = btrfs_root_used ( & fs_info - > block_group_root - > root_item ) ;
min_items + + ;
}
2023-09-14 19:06:57 +03:00
if ( btrfs_fs_incompat ( fs_info , RAID_STRIPE_TREE ) ) {
num_bytes + = btrfs_root_used ( & fs_info - > stripe_root - > root_item ) ;
min_items + + ;
}
2019-08-22 22:19:00 +03:00
/*
* But we also want to reserve enough space so we can do the fallback
2023-03-21 14:13:58 +03:00
* global reserve for an unlink , which is an additional
* BTRFS_UNLINK_METADATA_UNITS items .
2019-08-22 22:19:00 +03:00
*
* But we also need space for the delayed ref updates from the unlink ,
2023-03-21 14:13:59 +03:00
* so add BTRFS_UNLINK_METADATA_UNITS units for delayed refs , one for
* each unlink metadata item .
2019-08-22 22:19:00 +03:00
*/
2023-03-21 14:13:59 +03:00
min_items + = BTRFS_UNLINK_METADATA_UNITS ;
2019-08-22 22:19:00 +03:00
num_bytes = max_t ( u64 , num_bytes ,
2023-03-21 14:13:59 +03:00
btrfs_calc_insert_metadata_size ( fs_info , min_items ) +
btrfs_calc_delayed_ref_bytes ( fs_info ,
BTRFS_UNLINK_METADATA_UNITS ) ) ;
2019-06-19 20:47:23 +03:00
spin_lock ( & sinfo - > lock ) ;
spin_lock ( & block_rsv - > lock ) ;
block_rsv - > size = min_t ( u64 , num_bytes , SZ_512M ) ;
if ( block_rsv - > reserved < block_rsv - > size ) {
2019-08-22 22:19:01 +03:00
num_bytes = block_rsv - > size - block_rsv - > reserved ;
btrfs_space_info_update_bytes_may_use ( fs_info , sinfo ,
num_bytes ) ;
2020-02-04 14:05:58 +03:00
block_rsv - > reserved = block_rsv - > size ;
2019-06-19 20:47:23 +03:00
} else if ( block_rsv - > reserved > block_rsv - > size ) {
num_bytes = block_rsv - > reserved - block_rsv - > size ;
btrfs_space_info_update_bytes_may_use ( fs_info , sinfo ,
- num_bytes ) ;
block_rsv - > reserved = block_rsv - > size ;
2019-08-22 22:19:02 +03:00
btrfs_try_granting_tickets ( fs_info , sinfo ) ;
2019-06-19 20:47:23 +03:00
}
2022-06-23 18:08:14 +03:00
block_rsv - > full = ( block_rsv - > reserved = = block_rsv - > size ) ;
2019-06-19 20:47:23 +03:00
2020-03-13 22:28:48 +03:00
if ( block_rsv - > size > = sinfo - > total_bytes )
sinfo - > force_alloc = CHUNK_ALLOC_FORCE ;
2019-06-19 20:47:23 +03:00
spin_unlock ( & block_rsv - > lock ) ;
spin_unlock ( & sinfo - > lock ) ;
}
2021-11-05 23:45:44 +03:00
void btrfs_init_root_block_rsv ( struct btrfs_root * root )
{
struct btrfs_fs_info * fs_info = root - > fs_info ;
switch ( root - > root_key . objectid ) {
case BTRFS_CSUM_TREE_OBJECTID :
case BTRFS_EXTENT_TREE_OBJECTID :
2021-12-02 23:34:32 +03:00
case BTRFS_FREE_SPACE_TREE_OBJECTID :
2022-08-09 08:02:17 +03:00
case BTRFS_BLOCK_GROUP_TREE_OBJECTID :
2023-09-14 19:06:57 +03:00
case BTRFS_RAID_STRIPE_TREE_OBJECTID :
2021-11-05 23:45:44 +03:00
root - > block_rsv = & fs_info - > delayed_refs_rsv ;
break ;
case BTRFS_ROOT_TREE_OBJECTID :
case BTRFS_DEV_TREE_OBJECTID :
case BTRFS_QUOTA_TREE_OBJECTID :
root - > block_rsv = & fs_info - > global_block_rsv ;
break ;
case BTRFS_CHUNK_TREE_OBJECTID :
root - > block_rsv = & fs_info - > chunk_block_rsv ;
break ;
default :
root - > block_rsv = NULL ;
break ;
}
}
2019-06-19 20:47:23 +03:00
void btrfs_init_global_block_rsv ( struct btrfs_fs_info * fs_info )
{
struct btrfs_space_info * space_info ;
space_info = btrfs_find_space_info ( fs_info , BTRFS_BLOCK_GROUP_SYSTEM ) ;
fs_info - > chunk_block_rsv . space_info = space_info ;
space_info = btrfs_find_space_info ( fs_info , BTRFS_BLOCK_GROUP_METADATA ) ;
fs_info - > global_block_rsv . space_info = space_info ;
fs_info - > trans_block_rsv . space_info = space_info ;
fs_info - > empty_block_rsv . space_info = space_info ;
fs_info - > delayed_block_rsv . space_info = space_info ;
fs_info - > delayed_refs_rsv . space_info = space_info ;
btrfs_update_global_block_rsv ( fs_info ) ;
}
void btrfs_release_global_block_rsv ( struct btrfs_fs_info * fs_info )
{
2020-03-10 11:59:31 +03:00
btrfs_block_rsv_release ( fs_info , & fs_info - > global_block_rsv , ( u64 ) - 1 ,
NULL ) ;
2019-06-19 20:47:23 +03:00
WARN_ON ( fs_info - > trans_block_rsv . size > 0 ) ;
WARN_ON ( fs_info - > trans_block_rsv . reserved > 0 ) ;
WARN_ON ( fs_info - > chunk_block_rsv . size > 0 ) ;
WARN_ON ( fs_info - > chunk_block_rsv . reserved > 0 ) ;
WARN_ON ( fs_info - > delayed_block_rsv . size > 0 ) ;
WARN_ON ( fs_info - > delayed_block_rsv . reserved > 0 ) ;
WARN_ON ( fs_info - > delayed_refs_rsv . reserved > 0 ) ;
WARN_ON ( fs_info - > delayed_refs_rsv . size > 0 ) ;
}
static struct btrfs_block_rsv * get_block_rsv (
const struct btrfs_trans_handle * trans ,
const struct btrfs_root * root )
{
struct btrfs_fs_info * fs_info = root - > fs_info ;
struct btrfs_block_rsv * block_rsv = NULL ;
2020-05-15 09:01:40 +03:00
if ( test_bit ( BTRFS_ROOT_SHAREABLE , & root - > state ) | |
2021-11-05 23:45:48 +03:00
( root = = fs_info - > uuid_root ) | |
( trans - > adding_csums & &
root - > root_key . objectid = = BTRFS_CSUM_TREE_OBJECTID ) )
2019-06-19 20:47:23 +03:00
block_rsv = trans - > block_rsv ;
if ( ! block_rsv )
block_rsv = root - > block_rsv ;
if ( ! block_rsv )
block_rsv = & fs_info - > empty_block_rsv ;
return block_rsv ;
}
struct btrfs_block_rsv * btrfs_use_block_rsv ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
u32 blocksize )
{
struct btrfs_fs_info * fs_info = root - > fs_info ;
struct btrfs_block_rsv * block_rsv ;
struct btrfs_block_rsv * global_rsv = & fs_info - > global_block_rsv ;
int ret ;
bool global_updated = false ;
block_rsv = get_block_rsv ( trans , root ) ;
if ( unlikely ( block_rsv - > size = = 0 ) )
goto try_reserve ;
again :
ret = btrfs_block_rsv_use_bytes ( block_rsv , blocksize ) ;
if ( ! ret )
return block_rsv ;
if ( block_rsv - > failfast )
return ERR_PTR ( ret ) ;
if ( block_rsv - > type = = BTRFS_BLOCK_RSV_GLOBAL & & ! global_updated ) {
global_updated = true ;
btrfs_update_global_block_rsv ( fs_info ) ;
goto again ;
}
/*
* The global reserve still exists to save us from ourselves , so don ' t
* warn_on if we are short on our delayed refs reserve .
*/
if ( block_rsv - > type ! = BTRFS_BLOCK_RSV_DELREFS & &
btrfs_test_opt ( fs_info , ENOSPC_DEBUG ) ) {
static DEFINE_RATELIMIT_STATE ( _rs ,
DEFAULT_RATELIMIT_INTERVAL * 10 ,
/*DEFAULT_RATELIMIT_BURST*/ 1 ) ;
if ( __ratelimit ( & _rs ) )
WARN ( 1 , KERN_DEBUG
2020-10-26 23:57:26 +03:00
" BTRFS: block rsv %d returned %d \n " ,
block_rsv - > type , ret ) ;
2019-06-19 20:47:23 +03:00
}
try_reserve :
2023-09-08 20:20:20 +03:00
ret = btrfs_reserve_metadata_bytes ( fs_info , block_rsv - > space_info ,
blocksize , BTRFS_RESERVE_NO_FLUSH ) ;
2019-06-19 20:47:23 +03:00
if ( ! ret )
return block_rsv ;
/*
* If we couldn ' t reserve metadata bytes try and use some from
* the global reserve if its space type is the same as the global
* reservation .
*/
if ( block_rsv - > type ! = BTRFS_BLOCK_RSV_GLOBAL & &
block_rsv - > space_info = = global_rsv - > space_info ) {
ret = btrfs_block_rsv_use_bytes ( global_rsv , blocksize ) ;
if ( ! ret )
return global_rsv ;
}
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 16:35:01 +03:00
/*
* All hope is lost , but of course our reservations are overly
* pessimistic , so instead of possibly having an ENOSPC abort here , try
* one last time to force a reservation if there ' s enough actual space
* on disk to make the reservation .
*/
2023-09-08 20:20:20 +03:00
ret = btrfs_reserve_metadata_bytes ( fs_info , block_rsv - > space_info , blocksize ,
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 16:35:01 +03:00
BTRFS_RESERVE_FLUSH_EMERGENCY ) ;
if ( ! ret )
return block_rsv ;
2019-06-19 20:47:23 +03:00
return ERR_PTR ( ret ) ;
}
2023-04-29 23:07:10 +03:00
int btrfs_check_trunc_cache_free_space ( struct btrfs_fs_info * fs_info ,
struct btrfs_block_rsv * rsv )
{
u64 needed_bytes ;
int ret ;
/* 1 for slack space, 1 for updating the inode */
needed_bytes = btrfs_calc_insert_metadata_size ( fs_info , 1 ) +
btrfs_calc_metadata_size ( fs_info , 1 ) ;
spin_lock ( & rsv - > lock ) ;
if ( rsv - > reserved < needed_bytes )
ret = - ENOSPC ;
else
ret = 0 ;
spin_unlock ( & rsv - > lock ) ;
return ret ;
}