IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up whatever root userspace has given us, we need to hold a ref
throughout this operation. Use 'root' only for the on fs root and not as
a temporary variable elsewhere.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can wander into a different root, so grab a ref on the root we look
up. Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup a arbitrary fs root, we need to hold a ref on that root. If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking up the inode from an arbitrary tree means we need to hold a ref
on that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking up the inode we need to search the root, make sure we hold a
reference on that root while we're doing the lookup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're looking up a random root, we need to hold a ref on it while we're
using it.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the root is sitting in the radix tree, we should probably have a ref
for the radix tree. Grab a ref on the root when we insert it, and drop
it when it gets deleted.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add another comment to cover how the space reservation system works
generally. This covers the actual reservation flow, as well as how
flushing is handled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
delalloc space reservation is tricky because it encompasses both data
and metadata. Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a giant comment at the top of block-rsv.c describing generally
how block reserves work. It is purely about the block reserves
themselves, and nothing to do with how the actual reservation system
works.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All helpers should either be using btrfs_get_fs_root() or
btrfs_read_tree_root().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root(). However
the reloc root does need REF_COWS set, so make sure we set it everywhere
we use this helper, as it no longer does the REF_COWS setting.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree-log uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root. We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root. But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here. But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.
Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items. Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags. Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count. Fix this by pushing __setup_root into btrfs_alloc_root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the isize, remove all of this code
as it's no longer needed.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this everywhere we modify the file extent items
permanently. These include:
1) Inserting new file extents for writes and prealloc extents.
2) Truncating inode items.
3) btrfs_cont_expand().
4) Insert inline extents.
5) Insert new extents from log replay.
6) Insert a new extent for clone, as it could be past i_size.
7) Hole punching
For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble. Start with an empty file and
fallocate KEEP_SIZE 1M-2M
We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M. Now
punch 1M-1.5M
Because this is past i_size we have
[HOLE EXTENT][ NOTHING ][PREALLOC]
[0 1M][1M 1.5M][1.5M 2M]
with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.
However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we
commit the transaction here and crash we'll expose the gap.
To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for. This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.
I've written an xfstest to reproduce the problem and validate this fix.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.
Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were using btrfs_i_size_write(), which unconditionally jacks up
inode->disk_i_size. However since clone can operate on ranges we could
have pending ordered extents for a range prior to the start of our clone
operation and thus increase disk_i_size too far and have a hole with no
file extent.
Fix this by using the btrfs_ordered_update_i_size helper which will do
the right thing in the face of pending ordered extents outside of our
clone range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfsctl was removed in 2012, now the function btrfs_control_ioctl()
is only used for devices ioctls. So update the comment.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.
For such a complex facility, we should at least have some introduction
to it.
This patch will add an basic introduction at pretty a high level,
explaining:
- What relocation does
- How relocation is done
Only mentioning how data reloc tree and reloc tree are involved in the
operation.
No details like the backref cache, or the data reloc tree contents.
- Which function to refer.
More detailed comments will be added for reloc tree creation, data reloc
tree creation and backref cache.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Each new element added to the mod seq list is always appended to the list,
and each one gets a sequence number coming from a counter which gets
incremented everytime a new element is added to the list (or a new node
is added to the tree mod log rbtree). Therefore the element with the
lowest sequence number is always the first element in the list.
So just remove the list iteration at btrfs_put_tree_mod_seq() that
computes the minimum sequence number in the list and replace it with
a check for the first element's sequence number.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The overview of btrfs dev-replace. It mentions some corner cases caused
by the write duplication and scrub based data copy.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust wording ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl53Vh0ACgkQxWXV+ddt
WDtfOQ//bbUyKXcdH0FBZOCEcJmegcK1eUFYqKrwR2bHGe5JRdLM8pAvjCcqmWeO
jtaRiFC4NSCqTIl3mkBUb+XmQtjZwixBUHRxJpuEO8zqawvFZXTqg/KJklNvi2rd
KdflSNia6KrozTT+B/lpwZ5emS+wSdj5XTZ6VGj4riwtphSfWAjOu+4cOASMeFu+
Gfn+N9xu0ZcR/6zO20xAg0Xz+WU2uj4EfeM35dtRP2bPLG0yOGmiYT15Ll9h74Wm
7F+28iNTQfYutAexGvUpiouanGXE+ka3TCsJg5LuVTpdKGraOVGEuX+RhsyoKQrB
E8bk91fbkLlooluhUC306iNA9/+RN/yFGtILX8JsgI2Od26ZuU01l/OHrc19MDIm
gw1w3PMsD/hXLsG5ba4QsIYOzXofSrPdWej29h/o5p0VEQrAoCJEpAi7fVsiJDR1
sx6kCodw5jYhVs1P6DdXO1pgjE7iFUmjUQCFkl40edPMLy/LwB99A4zNnCOwI0KZ
49CMWHDe+tXVJBTzPvtma/PycQHIxJYMf1f8ko9E4stB7HtfH4dnUERDkb1UwQ5n
aJgyhsCCnp/EJoPunUT7g9nLUdyu0Rtwknn3NascWZEieX2QhKEF5RcjAUSL+Hlo
jbGGvoLhG0nOtYkU7BNSQbL8wxPJEEAq8e6F4tWMcOkhX4pNZP8=
=YkB0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two fixes.
The first is a regression: when dropping some incompat bits the
conditions were reversed. The other is a fix for rename whiteout
potentially leaving stack memory linked to a list"
* tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
btrfs: fix log context list corruption after rename whiteout error
We are incorrectly dropping the raid56 and raid1c34 incompat flags when
there are still raid56 and raid1c34 block groups, not when we do not any
of those anymore. The logic just got unintentionally broken after adding
the support for the raid1c34 modes.
Fix this by clear the flags only if we do not have block groups with the
respective profiles.
Fixes: 9c907446dce3 ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
we can end up returning from btrfs_rename() with the log context object
still in the root's log context list - this happens if 'sync_log' was
set to true before we called btrfs_whiteout_for_rename() and it is
dangerous because we end up with a corrupt linked list (root->log_ctxs)
as the log context object was allocated on the stack.
After btrfs_rename() returns, any task that is running btrfs_sync_log()
concurrently can end up crashing because that linked list is traversed by
btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
the same issue that commit e6c617102c7e4 ("Btrfs: fix log context list
corruption after rename exchange operation") fixed.
Fixes: d4682ba03ef618 ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5iXe8ACgkQxWXV+ddt
WDvWGg/+LFP+Y8Qz6xHTl3vXuGJKjCr7X/MIi69r2N0JFoCUeXyOdxeSlOuNCfhb
HiLZzfA5TYoptsdLJAXQLy7nPKFCQcc+J19Mbt2+aebpdGqfgN+YZEGkltfKL8Ao
xjOGu5HROFFpNTtnwa1dYOQkyVuZ8oafuJxwVJ8T28fxepRvBbi5jmy3lb1ypL3W
NoIPBe+67g5z/W0ATFmBMF7cCbvS5gsEGWKpbbjh7r8ZHJkhUaxVU7YdxPqlXrAO
ejZfiJUwi8rTGm0zd8A5TX/wsxSeBEXolvh91k5tatTljjzROHa028KRg2voUZIW
C5/7X+Z2C3gzuT0o7TGLBOR6CkVhkSutDV8/QIE6hDjZ/aCMNi0mIFco1hG8jjd1
jQfjemjj7PWuVEnZ6EuVSoHSXjZvBvX66of40YhTQEtSaJpcZU4jP26+8cXENN6+
6WbWcQpEQbT0cp0YKWhWvAIwGMf0jmWESISeFMEaF0eQd8BtzrH1qYcs3JTmXvHC
XmC47hoEJLhjQkAgQ4oNa5PZQzR1wEfW/4FPdqlADOR2frE1wDiKdrpN/dkAYHdQ
edNlo9u0+bRWCP40p04i2IUX/aUAc+me9QxiZwxT3Fw0g5QBSE2035Ly4spvT8NZ
gIvwnq1KGxmtrJSo5Lpkv4bjHYbByYMOiGJUMOTCIEdqajFI224=
=06pr
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One fixup for DIO when in use with the new checksums, a missed case
where the checksum size was still assuming u32"
* tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix RAID direct I/O reads with alternate csums
btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
32-bit checksums. If using a larger checksum, this leads to spurious
failures when a direct I/O read crosses a stripe. This is easy
to reproduce:
# mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
...
# mount /dev/vdc /mnt
# cd /mnt
# dd if=/dev/urandom of=foo bs=1M count=1 status=none
# dd if=foo of=/dev/null bs=1M iflag=direct status=none
dd: error reading 'foo': Input/output error
# dmesg | tail -1
[ 135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...
Fix it by using the actual checksum size.
Fixes: 1e25a2e3ca0d ("btrfs: don't assume ordered sums to be 4 bytes")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5SdTAACgkQxWXV+ddt
WDtOPQ/+PGbN2bfZSf0DBUcOC+IOdiCYWHmaO6NQZzQb4CTjuumsDErSTzbg77OQ
gBqRBfcaWnyZZIabHe0JjctQ//buycP8yGb8VevgVdD4qBnb5sM/pvBqT4kO9Omy
6K3W7RhcP8gejOxGJb4nuiUBdYRW3KTji8QT3TMv/njuqn0p8n1NgzVlz0m6+xtO
IfJOqThvvG9LBytsQq/pvqaoXID/06lXRyM42XbsKFyygv10vp69xz5Skdl1XkGk
pyymzLlLDkLorRuDkjnzIvOqNFDAYiSFJafHupSE4SlfscRYfTwCV0uXE+NX4bdL
piTr5169ALoIRyTHU37YNNCPZXNGHnmn5Mtf4o8Q0ps0MKz1+sIjFRAxrdolLHjw
iYcocXU9L5wtHUBouTBzAsnuWJnizJNExHYb/3MHnHZYDu371l8wOGL1AHzRXEm/
qkxTPS3V2JpaFkFvTmwEQbl8PzDgpxpPDDBQUcEBZn3Vb9AFX38Fo/5OiGOnpnTd
9wVKXK2S6vHz/xpfcT/3SOtDUljSPJMUXUVkVcdz+OfUWvn6icXzR0t4keXVqZv+
INSOHzXb+iCuIb+NX7VZTN9oq0GH58aA+ApIo5beNbwJ6EzFotGE8dQmK3AfQ6pU
Pod0gzW5Hqj9Q2AfWYTZjWP+Og+dF2+bQFYSRK8NjcKjcrvUEVU=
=Asj0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"These are fixes that were found during testing with help of error
injection, plus some other stable material.
There's a fixup to patch added to rc1 causing locking in wrong context
warnings, tests found one more deadlock scenario. The patches are
tagged for stable, two of them now in the queue but we'd like all
three released at the same time.
I'm not happy about fixes to fixes in such a fast succession during
rcs, but I hope we found all the fallouts of commit 28553fa992cb
('Btrfs: fix race between shrinking truncate and fiemap')"
* tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents
btrfs: fix bytes_may_use underflow in prealloc error condtition
btrfs: handle logged extent failure properly
btrfs: do not check delayed items are empty for single transaction cleanup
btrfs: reset fs_root to NULL on error in open_ctree
btrfs: destroy qgroup extent records on transaction abort
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_wait_ordered_range() once we find an ordered extent that has
finished with an error we exit the loop and don't wait for any other
ordered extents that might be still in progress.
All the users of btrfs_wait_ordered_range() expect that there are no more
ordered extents in progress after that function returns. So past fixes
such like the ones from the two following commits:
ff612ba7849964 ("btrfs: fix panic during relocation after ENOSPC before
writeback happens")
28aeeac1dd3080 ("Btrfs: fix panic when starting bg cache writeout after
IO error")
don't work when there are multiple ordered extents in the range.
Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
even after it finds one that had an error.
Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit the following warning while running my error injection stress
testing:
WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
Call Trace:
btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
__btrfs_prealloc_file_range+0x378/0x470 [btrfs]
elfcorehdr_read+0x40/0x40
? elfcorehdr_read+0x40/0x40
? btrfs_commit_transaction+0xca/0xa50 [btrfs]
? dput+0xb4/0x2a0
? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
? btrfs_sync_file+0x30e/0x420 [btrfs]
? do_fsync+0x38/0x70
? __x64_sys_fdatasync+0x13/0x20
? do_syscall_64+0x5b/0x1b0
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens if we fail to insert our reserved file extent. At this
point we've already converted our reservation from ->bytes_may_use to
->bytes_reserved. However once we break we will attempt to free
everything from [cur_offset, end] from ->bytes_may_use, but our extent
reservation will overlap part of this.
Fix this problem by adding ins.offset (our extent allocation size) to
cur_offset so we remove the actual remaining part from ->bytes_may_use.
I validated this fix using my inject-error.py script
python inject-error.py -o should_fail_bio -t cache_save_setup -t \
__btrfs_prealloc_file_range \
-t insert_reserved_file_extent.constprop.0 \
-r "-5" ./run-fsstress.sh
where run-fsstress.sh simply mounts and runs fsstress on a disk.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly. We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used. However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount. Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_assert_delayed_root_empty() will check if the delayed root is
completely empty, but this is a filesystem-wide check. On cleanup we
may have allowed other transactions to begin, for whatever reason, and
thus the delayed root is not empty.
So remove this check from cleanup_one_transation(). This however can
stay in btrfs_cleanup_transaction(), because it checks only after all of
the transactions have been properly cleaned up, and thus is valid.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5K0xIACgkQxWXV+ddt
WDvihhAAjTdEr8Gbt4Beu+F9I2j+1+x/qdVADsrJErBJ5s1J0prXI27Vx5auHMMD
qJViTGTTjwlIo4jFI+AtlrzcE07eIMN1zzjVb2o8efrpkFDTfgT+dAkSVR8zAX20
nycsQ93QebCrlRGpPuXsniczy+QTRXWBY+vPXe1UbfGrzb3q+1pBJgfdq/7pkt/V
0RgB78JDVWUhw+YBjLuV0Y7tjmWlANGvofT34A34CR23i/JoIjVx7GfeQyILBmHE
iuCwzMgAdIKlRpjG5WJeK60bW0Dkjmbr3HDyJexiXSa4u2UwgUlT8UsITbVCr48t
OrfYxb7aWmwEqa5fbZzIiyTCWDKSxiQ5X0XW8e6ehdkh+s9DWmNp48xt9OogpYcv
8qni+vMgjrOLZwj//DgHpr2CYrISxJ+vhCvaOLMK++8TMIH/EQZEyaYkKn5Osg/8
iSwI9b2IF8LmnnvWhkbs1p4Jn2Ca4fMhlN2D5xflwSK/lbhgJk5E6P+4h6pQ9MvG
U5bWR2mqW1xT+lCO7AD+MBq+VbF/BZik0TjuNssoYJJtx/VbTbpydrHW65SvZvQK
QT7MTHekkbtIR+SGYl66n1Sc4RG7znWNwjfoyVEbSblLIczfzDDsl1O/9ldqbia4
BoPWghEO8Wy543UTH9DyX+oMzB2Pwb/TMa4eQyNhbQMqcwXHv5Q=
=nlEW
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"This is the fix for sleeping in a locked section bug reported by Dave
Jones, caused by a patch dependence in development and pulled
branches.
I picked the existing patch over the fixup that Filipe sent, as it's a
bit more generic fix. I've verified it with a specific test case, some
rsync stress and one round of fstests"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5JQQUACgkQxWXV+ddt
WDv+fRAAh2VeACdVgGPCqH3kBphdBzqT2x9EUWZK++AWTr0Upg4Mz1mvD2h1WX6L
Yd/cuG0hIy2AFl1fEwrOp9d64PGFRnVcs0IS0C9GC/kH9ZzrSuz3pcfni3uxCr0N
t+/Pghxb654fLPd8eho9Pnc9bi/EPPRZ+mmwup4A9MyC7Bdfzf821ykrhqwam54V
LWRUanVfeJOuxNbgF2Hjnr9ZI2fALsFfh5+q1VCeNsm76Le67RGVi7V6i9Gxj8eE
3zOH/XM1Jzft3cLinfimIUC0BQKBuFXSKnYun5zP47Y5VHgxYCcfBBAD71uGgs52
mxaz1vLGE6GZo631zBBXvM6BalEHyQROP8Daman9NiJDctkvICXzv/bwi4k5ZHSF
i45MANAtBb10S/AYitscqmGR7sYHZtKBkgJHjH+0YqKgcu3OWInqNVuW77kZEguV
/UjcdjAmxirvyAaxOdfXtBfuZKBU1kL+dZwg4OeqzbDoBsUAtkEG+3oIcjkGnP9e
6QuQx5XYKmI1tugcAz0uj5ryXW6UtIFVy+Fgxht+XMSNRR5CiUpu7Ndm/ZrKx5RS
/hv2KZot+/p7IA9J0YvZbYwwvw28YxR4PgGMOKYhqKnTVHR6eaFTDNKiaQRkJaXr
Nimk6TQrpc9Qu/rcoeyuQajxCRhwSqNAJH97MJPMDz39xolEys8=
=i1Ct
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two races fixed, memory leak fix, sysfs directory fixup and two new
log messages:
- two fixed race conditions: extent map merging and truncate vs
fiemap
- create the right sysfs directory with device information and move
the individual device dirs under it
- print messages when the tree-log is replayed at mount time or
cannot be replayed on remount"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs, move device id directories to UUID/devinfo
btrfs: sysfs, add UUID/devinfo kobject
Btrfs: fix race between shrinking truncate and fiemap
btrfs: log message when rw remount is attempted with unclean tree-log
btrfs: print message when tree-log replay starts
Btrfs: fix race between using extent maps and merging them
btrfs: ref-verify: fix memory leaks
Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.
Fixes: 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>