2018-04-03 19:16:55 +02:00
/* SPDX-License-Identifier: GPL-2.0 */
2014-05-13 17:30:47 -07:00
/*
* Copyright ( C ) 2014 Facebook . All rights reserved .
*/
2018-04-03 19:16:55 +02:00
# ifndef BTRFS_QGROUP_H
# define BTRFS_QGROUP_H
2014-05-13 17:30:47 -07:00
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
# include <linux/spinlock.h>
# include <linux/rbtree.h>
2015-04-16 14:34:17 +08:00
# include "ulist.h"
# include "delayed-ref.h"
2016-10-18 09:31:26 +08:00
/*
* Btrfs qgroup overview
*
* Btrfs qgroup splits into 3 main part :
* 1 ) Reserve
* Reserve metadata / data space for incoming operations
* Affect how qgroup limit works
*
* 2 ) Trace
* Tell btrfs qgroup to trace dirty extents .
*
* Dirty extents including :
* - Newly allocated extents
* - Extents going to be deleted ( in this trans )
* - Extents whose owner is going to be modified
*
* This is the main part affects whether qgroup numbers will stay
* consistent .
* Btrfs qgroup can trace clean extents and won ' t cause any problem ,
* but it will consume extra CPU time , it should be avoided if possible .
*
* 3 ) Account
* Btrfs qgroup will updates its numbers , based on dirty extents traced
* in previous step .
*
* Normally at qgroup rescan and transaction commit time .
*/
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
/*
* Special performance optimization for balance .
*
* For balance , we need to swap subtree of subvolume and reloc trees .
* In theory , we need to trace all subtree blocks of both subvolume and reloc
* trees , since their owner has changed during such swap .
*
* However since balance has ensured that both subtrees are containing the
* same contents and have the same tree structures , such swap won ' t cause
* qgroup number change .
*
* But there is a race window between subtree swap and transaction commit ,
* during that window , if we increase / decrease tree level or merge / split tree
* blocks , we still need to trace the original subtrees .
*
* So for balance , we use a delayed subtree tracing , whose workflow is :
*
* 1 ) Record the subtree root block get swapped .
*
* During subtree swap :
* O = Old tree blocks
* N = New tree blocks
* reloc tree subvolume tree X
* Root Root
* / \ / \
* NA OB OA OB
* / | | \ / | | \
* NC ND OE OF OC OD OE OF
*
* In this case , NA and OA are going to be swapped , record ( NA , OA ) into
* subvolume tree X .
*
* 2 ) After subtree swap .
* reloc tree subvolume tree X
* Root Root
* / \ / \
* OA OB NA OB
* / | | \ / | | \
* OC OD OE OF NC ND OE OF
*
* 3 a ) COW happens for OB
* If we are going to COW tree block OB , we check OB ' s bytenr against
* tree X ' s swapped_blocks structure .
* If it doesn ' t fit any , nothing will happen .
*
* 3 b ) COW happens for NA
* Check NA ' s bytenr against tree X ' s swapped_blocks , and get a hit .
* Then we do subtree scan on both subtrees OA and NA .
* Resulting 6 tree blocks to be scanned ( OA , OC , OD , NA , NC , ND ) .
*
* Then no matter what we do to subvolume tree X , qgroup numbers will
* still be correct .
* Then NA ' s record gets removed from X ' s swapped_blocks .
*
* 4 ) Transaction commit
* Any record in X ' s swapped_blocks gets removed , since there is no
* modification to the swapped subtrees , no need to trigger heavy qgroup
* subtree rescan for them .
*/
2015-04-16 14:34:17 +08:00
/*
* Record a dirty extent , and info qgroup to update quota on it
* TODO : Use kmem cache to alloc it .
*/
struct btrfs_qgroup_extent_record {
struct rb_node node ;
u64 bytenr ;
u64 num_bytes ;
btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.
Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.
The simplified reproducer will be: (needs a 2G ram VM)
$ mkfs.btrfs -f $dev
$ mount $dev $mnt
$ btrfs subv create $mnt/subv
$ btrfs quota enable $mnt
$ btrfs quota rescan -w $mnt
$ btrfs qgroup limit -e 1G $mnt/subv
$ for i in $(seq -w 1 8); do
xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
echo "file $i written" > /dev/kmsg
done
$ sync
$ btrfs qgroup show -pcre --raw $mnt
The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16384 16384 none none --- ---
0/256 1073758208 1073758208 none 1073741824 --- ---
And 1073758208 is larger than
> 1073741824.
[CAUSE]
It's a bug in btrfs qgroup data reserved space management.
For quota limit, we must ensure that:
reserved (data + metadata) + rfer/excl <= limit
Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.
One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.
Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.
However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.
The related events will be something like:
file 1 written
btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
cleanup_ref_head: num_bytes=54947840
cleanup_ref_head: num_bytes=5636096
cleanup_ref_head: num_bytes=569344
cleanup_ref_head: num_bytes=57344
cleanup_ref_head: num_bytes=8192
^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
file 2 written
...
file 8 written
cleanup_ref_head: num_bytes=8192
...
btrfs_commit_transaction <<< the only transaction committed during
the test
When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.
This allows us to write more data beyond qgroup limit.
In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.
Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:12 +08:00
/*
* For qgroup reserved data space freeing .
*
* @ data_rsv_refroot and @ data_rsv will be recorded after
* BTRFS_ADD_DELAYED_EXTENT is called .
* And will be used to free reserved qgroup space at
* transaction commit time .
*/
u32 data_rsv ; /* reserved data space needs to be freed */
u64 data_rsv_refroot ; /* which root the reserved data belongs to */
2015-04-16 14:34:17 +08:00
struct ulist * old_roots ;
} ;
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
struct btrfs_qgroup_swapped_block {
struct rb_node node ;
int level ;
bool trace_leaf ;
/* bytenr/generation of the tree block in subvolume tree after swap */
u64 subvol_bytenr ;
u64 subvol_generation ;
/* bytenr/generation of the tree block in reloc tree after swap */
u64 reloc_bytenr ;
u64 reloc_generation ;
u64 last_snapshot ;
struct btrfs_key first_key ;
} ;
btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans
Btrfs uses 2 different methods to reseve metadata qgroup space.
1) Reserve at btrfs_start_transaction() time
This is quite straightforward, caller will use the trans handler
allocated to modify b-trees.
In this case, reserved metadata should be kept until qgroup numbers
are updated.
2) Reserve by using block_rsv first, and later btrfs_join_transaction()
This is more complicated, caller will reserve space using block_rsv
first, and then later call btrfs_join_transaction() to get a trans
handle.
In this case, before we modify trees, the reserved space can be
modified on demand, and after btrfs_join_transaction(), such reserved
space should also be kept until qgroup numbers are updated.
Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:
META_PERTRANS:
For above case 1)
META_PREALLOC:
For reservations that happened before btrfs_join_transaction() of
case 2)
NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-12 15:34:29 +08:00
/*
* Qgroup reservation types :
*
* DATA :
* space reserved for data
*
* META_PERTRANS :
* Space reserved for metadata ( per - transaction )
* Due to the fact that qgroup data is only updated at transaction commit
* time , reserved space for metadata must be kept until transaction
* commits .
* Any metadata reserved that are used in btrfs_start_transaction ( ) should
* be of this type .
*
* META_PREALLOC :
* There are cases where metadata space is reserved before starting
* transaction , and then btrfs_join_transaction ( ) to get a trans handle .
* Any metadata reserved for such usage should be of this type .
* And after join_transaction ( ) part ( or all ) of such reservation should
* be converted into META_PERTRANS .
*/
2017-12-12 15:34:23 +08:00
enum btrfs_qgroup_rsv_type {
2018-11-27 15:25:13 +01:00
BTRFS_QGROUP_RSV_DATA ,
btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans
Btrfs uses 2 different methods to reseve metadata qgroup space.
1) Reserve at btrfs_start_transaction() time
This is quite straightforward, caller will use the trans handler
allocated to modify b-trees.
In this case, reserved metadata should be kept until qgroup numbers
are updated.
2) Reserve by using block_rsv first, and later btrfs_join_transaction()
This is more complicated, caller will reserve space using block_rsv
first, and then later call btrfs_join_transaction() to get a trans
handle.
In this case, before we modify trees, the reserved space can be
modified on demand, and after btrfs_join_transaction(), such reserved
space should also be kept until qgroup numbers are updated.
Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:
META_PERTRANS:
For above case 1)
META_PREALLOC:
For reservations that happened before btrfs_join_transaction() of
case 2)
NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-12 15:34:29 +08:00
BTRFS_QGROUP_RSV_META_PERTRANS ,
BTRFS_QGROUP_RSV_META_PREALLOC ,
2017-12-12 15:34:23 +08:00
BTRFS_QGROUP_RSV_LAST ,
} ;
/*
* Represents how many bytes we have reserved for this qgroup .
*
* Each type should have different reservation behavior .
* E . g , data follows its io_tree flag modification , while
2018-11-28 12:05:13 +01:00
* * currently * meta is just reserve - and - clear during transaction .
2017-12-12 15:34:23 +08:00
*
* TODO : Add new type for reservation which can survive transaction commit .
2018-11-28 12:05:13 +01:00
* Current metadata reservation behavior is not suitable for such case .
2017-12-12 15:34:23 +08:00
*/
struct btrfs_qgroup_rsv {
u64 values [ BTRFS_QGROUP_RSV_LAST ] ;
} ;
2017-03-13 15:52:08 +08:00
/*
* one struct for each qgroup , organized in fs_info - > qgroup_tree .
*/
struct btrfs_qgroup {
u64 qgroupid ;
/*
* state
*/
u64 rfer ; /* referenced */
u64 rfer_cmpr ; /* referenced compressed */
u64 excl ; /* exclusive */
u64 excl_cmpr ; /* exclusive compressed */
/*
* limits
*/
u64 lim_flags ; /* which limits are set */
u64 max_rfer ;
u64 max_excl ;
u64 rsv_rfer ;
u64 rsv_excl ;
/*
* reservation tracking
*/
2017-12-12 15:34:23 +08:00
struct btrfs_qgroup_rsv rsv ;
2017-03-13 15:52:08 +08:00
/*
* lists
*/
struct list_head groups ; /* groups this group is member of */
struct list_head members ; /* groups that are members of this group */
struct list_head dirty ; /* dirty groups */
struct rb_node node ; /* tree of qgroups */
/*
* temp variables for accounting operations
* Refer to qgroup_shared_accounting ( ) for details .
*/
u64 old_refcnt ;
u64 new_refcnt ;
} ;
2015-09-28 16:57:53 +08:00
/*
* For qgroup event trace points only
*/
# define QGROUP_RESERVE (1<<0)
# define QGROUP_RELEASE (1<<1)
# define QGROUP_FREE (1<<2)
2018-07-05 14:50:48 +03:00
int btrfs_quota_enable ( struct btrfs_fs_info * fs_info ) ;
int btrfs_quota_disable ( struct btrfs_fs_info * fs_info ) ;
2014-05-13 17:30:47 -07:00
int btrfs_qgroup_rescan ( struct btrfs_fs_info * fs_info ) ;
void btrfs_qgroup_rescan_resume ( struct btrfs_fs_info * fs_info ) ;
2016-08-08 22:08:06 -04:00
int btrfs_qgroup_wait_for_completion ( struct btrfs_fs_info * fs_info ,
bool interruptible ) ;
2018-07-18 14:45:30 +08:00
int btrfs_add_qgroup_relation ( struct btrfs_trans_handle * trans , u64 src ,
u64 dst ) ;
2018-07-18 14:45:32 +08:00
int btrfs_del_qgroup_relation ( struct btrfs_trans_handle * trans , u64 src ,
u64 dst ) ;
2018-07-18 14:45:33 +08:00
int btrfs_create_qgroup ( struct btrfs_trans_handle * trans , u64 qgroupid ) ;
2018-07-18 14:45:34 +08:00
int btrfs_remove_qgroup ( struct btrfs_trans_handle * trans , u64 qgroupid ) ;
2018-07-18 14:45:35 +08:00
int btrfs_limit_qgroup ( struct btrfs_trans_handle * trans , u64 qgroupid ,
2014-05-13 17:30:47 -07:00
struct btrfs_qgroup_limit * limit ) ;
int btrfs_read_qgroup_config ( struct btrfs_fs_info * fs_info ) ;
void btrfs_free_qgroup_config ( struct btrfs_fs_info * fs_info ) ;
struct btrfs_delayed_extent_op ;
2017-02-27 15:10:35 +08:00
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
/*
2016-10-18 09:31:27 +08:00
* Inform qgroup to trace one dirty extent , its info is recorded in @ record .
2017-02-15 10:43:03 +08:00
* So qgroup can account it at transaction committing time .
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
*
2017-02-15 10:43:03 +08:00
* No lock version , caller must acquire delayed ref lock and allocated memory ,
* then call btrfs_qgroup_trace_extent_post ( ) after exiting lock context .
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
*
* Return 0 for success insert
* Return > 0 for existing record , caller can free @ record safely .
* Error is not possible
*/
2016-10-18 09:31:27 +08:00
int btrfs_qgroup_trace_extent_nolock (
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
struct btrfs_fs_info * fs_info ,
struct btrfs_delayed_ref_root * delayed_refs ,
struct btrfs_qgroup_extent_record * record ) ;
2017-02-15 10:43:03 +08:00
/*
* Post handler after qgroup_trace_extent_nolock ( ) .
*
* NOTE : Current qgroup does the expensive backref walk at transaction
* committing time with TRANS_STATE_COMMIT_DOING , this blocks incoming
* new transaction .
* This is designed to allow btrfs_find_all_roots ( ) to get correct new_roots
* result .
*
* However for old_roots there is no need to do backref walk at that time ,
* since we search commit roots to walk backref and result will always be
* correct .
*
* Due to the nature of no lock version , we can ' t do backref there .
* So we must call btrfs_qgroup_trace_extent_post ( ) after exiting
* spinlock context .
*
* TODO : If we can fix and prove btrfs_find_all_roots ( ) can get correct result
* using current root , then we can move all expensive backref walk out of
* transaction committing , but not now as qgroup accounting will be wrong again .
*/
int btrfs_qgroup_trace_extent_post ( struct btrfs_fs_info * fs_info ,
struct btrfs_qgroup_extent_record * qrecord ) ;
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
/*
2016-10-18 09:31:27 +08:00
* Inform qgroup to trace one dirty extent , specified by @ bytenr and
* @ num_bytes .
* So qgroup can account it at commit trans time .
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
*
2017-02-15 10:43:03 +08:00
* Better encapsulated version , with memory allocation and backref walk for
* commit roots .
* So this can sleep .
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
*
* Return 0 if the operation is done .
* Return < 0 for error , like memory allocation failure or invalid parameter
* ( NULL trans )
*/
2018-07-18 16:28:03 +08:00
int btrfs_qgroup_trace_extent ( struct btrfs_trans_handle * trans , u64 bytenr ,
u64 num_bytes , gfp_t gfp_flag ) ;
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-15 10:36:50 +08:00
2016-10-18 09:31:28 +08:00
/*
* Inform qgroup to trace all leaf items of data
*
* Return 0 for success
* Return < 0 for error ( ENOMEM )
*/
int btrfs_qgroup_trace_leaf_items ( struct btrfs_trans_handle * trans ,
struct extent_buffer * eb ) ;
/*
* Inform qgroup to trace a whole subtree , including all its child tree
* blocks and data .
* The root tree block is specified by @ root_eb .
*
* Normally used by relocation ( tree block swap ) and subvolume deletion .
*
* Return 0 for success
* Return < 0 for error ( ENOMEM or tree search error )
*/
int btrfs_qgroup_trace_subtree ( struct btrfs_trans_handle * trans ,
struct extent_buffer * root_eb ,
u64 root_gen , int root_level ) ;
2018-07-18 14:45:39 +08:00
int btrfs_qgroup_account_extent ( struct btrfs_trans_handle * trans , u64 bytenr ,
u64 num_bytes , struct ulist * old_roots ,
struct ulist * new_roots ) ;
2018-03-15 16:00:25 +02:00
int btrfs_qgroup_account_extents ( struct btrfs_trans_handle * trans ) ;
2018-07-18 14:45:40 +08:00
int btrfs_run_qgroups ( struct btrfs_trans_handle * trans ) ;
2018-07-18 14:45:41 +08:00
int btrfs_qgroup_inherit ( struct btrfs_trans_handle * trans , u64 srcid ,
u64 objectid , struct btrfs_qgroup_inherit * inherit ) ;
2015-09-08 17:08:37 +08:00
void btrfs_qgroup_free_refroot ( struct btrfs_fs_info * fs_info ,
2017-12-12 15:34:23 +08:00
u64 ref_root , u64 num_bytes ,
enum btrfs_qgroup_rsv_type type ) ;
2014-05-13 17:30:47 -07:00
# ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
int btrfs_verify_qgroup_counts ( struct btrfs_fs_info * fs_info , u64 qgroupid ,
u64 rfer , u64 excl ) ;
# endif
2015-10-12 16:05:40 +08:00
/* New io_tree based accurate qgroup reserve API */
2017-02-27 15:10:38 +08:00
int btrfs_qgroup_reserve_data ( struct inode * inode ,
struct extent_changeset * * reserved , u64 start , u64 len ) ;
2015-10-12 16:28:06 +08:00
int btrfs_qgroup_release_data ( struct inode * inode , u64 start , u64 len ) ;
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)
Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)
[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.
And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.
[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-27 15:10:39 +08:00
int btrfs_qgroup_free_data ( struct inode * inode ,
struct extent_changeset * reserved , u64 start , u64 len ) ;
2015-09-08 17:08:38 +08:00
btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans
Btrfs uses 2 different methods to reseve metadata qgroup space.
1) Reserve at btrfs_start_transaction() time
This is quite straightforward, caller will use the trans handler
allocated to modify b-trees.
In this case, reserved metadata should be kept until qgroup numbers
are updated.
2) Reserve by using block_rsv first, and later btrfs_join_transaction()
This is more complicated, caller will reserve space using block_rsv
first, and then later call btrfs_join_transaction() to get a trans
handle.
In this case, before we modify trees, the reserved space can be
modified on demand, and after btrfs_join_transaction(), such reserved
space should also be kept until qgroup numbers are updated.
Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:
META_PERTRANS:
For above case 1)
META_PREALLOC:
For reservations that happened before btrfs_join_transaction() of
case 2)
NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-12 15:34:29 +08:00
int __btrfs_qgroup_reserve_meta ( struct btrfs_root * root , int num_bytes ,
enum btrfs_qgroup_rsv_type type , bool enforce ) ;
/* Reserve metadata space for pertrans and prealloc type */
static inline int btrfs_qgroup_reserve_meta_pertrans ( struct btrfs_root * root ,
int num_bytes , bool enforce )
{
return __btrfs_qgroup_reserve_meta ( root , num_bytes ,
BTRFS_QGROUP_RSV_META_PERTRANS , enforce ) ;
}
static inline int btrfs_qgroup_reserve_meta_prealloc ( struct btrfs_root * root ,
int num_bytes , bool enforce )
{
return __btrfs_qgroup_reserve_meta ( root , num_bytes ,
BTRFS_QGROUP_RSV_META_PREALLOC , enforce ) ;
}
void __btrfs_qgroup_free_meta ( struct btrfs_root * root , int num_bytes ,
enum btrfs_qgroup_rsv_type type ) ;
/* Free per-transaction meta reservation for error handling */
static inline void btrfs_qgroup_free_meta_pertrans ( struct btrfs_root * root ,
int num_bytes )
{
__btrfs_qgroup_free_meta ( root , num_bytes ,
BTRFS_QGROUP_RSV_META_PERTRANS ) ;
}
/* Pre-allocated meta reservation can be freed at need */
static inline void btrfs_qgroup_free_meta_prealloc ( struct btrfs_root * root ,
int num_bytes )
{
__btrfs_qgroup_free_meta ( root , num_bytes ,
BTRFS_QGROUP_RSV_META_PREALLOC ) ;
}
/*
* Per - transaction meta reservation should be all freed at transaction commit
* time
*/
void btrfs_qgroup_free_meta_all_pertrans ( struct btrfs_root * root ) ;
2017-12-12 15:34:31 +08:00
/*
* Convert @ num_bytes of META_PREALLOCATED reservation to META_PERTRANS .
*
* This is called when preallocated meta reservation needs to be used .
* Normally after btrfs_join_transaction ( ) call .
*/
void btrfs_qgroup_convert_reserved_meta ( struct btrfs_root * root , int num_bytes ) ;
2015-10-13 09:53:10 +08:00
void btrfs_qgroup_check_reserved_leak ( struct inode * inode ) ;
2018-04-03 19:16:55 +02:00
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
/* btrfs_qgroup_swapped_blocks related functions */
void btrfs_qgroup_init_swapped_blocks (
struct btrfs_qgroup_swapped_blocks * swapped_blocks ) ;
void btrfs_qgroup_clean_swapped_blocks ( struct btrfs_root * root ) ;
int btrfs_qgroup_add_swapped_blocks ( struct btrfs_trans_handle * trans ,
struct btrfs_root * subvol_root ,
2019-10-29 19:20:18 +01:00
struct btrfs_block_group * bg ,
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
struct extent_buffer * subvol_parent , int subvol_slot ,
struct extent_buffer * reloc_parent , int reloc_slot ,
u64 last_snapshot ) ;
btrfs: qgroup: Use delayed subtree rescan for balance
Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.
This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.
However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.
It's the race window between subtree swap and transaction commit could
cause qgroup number change.
This patch will delay the qgroup subtree scan until COW happens for the
subtree root.
So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.
Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
And after file system population, there is no other activity, so it
should be the best case scenario.
| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:17 +08:00
int btrfs_qgroup_trace_subtree_after_cow ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root , struct extent_buffer * eb ) ;
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
2018-04-03 19:16:55 +02:00
# endif