IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 2398091f9c2c8e0040f4f9928666787a3e8108a7 upstream.
The type of parameter generation has been u32 since the beginning,
however all callers pass a u64 generation, so unify the types to prevent
potential loss.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d37de92b38932d40e4a251e876cc388f9aee5f42 ]
In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests,
if we fail to add the tree ref, remove the extent item or remove the
extent ref, we are returning from the test function without freeing the
"old_roots" ulist that was allocated by the previous calls to
btrfs_find_all_roots(). Fix that by calling ulist_free() before returning.
Fixes: 442244c96332 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit b51111271b0352aa596c5ae8faf06939e91b3b68 upstream.
For a filesystem which has btrfs read-only property set to true, all
write operations including xattr should be denied. However, security
xattr can still be changed even if btrfs ro property is true.
This happens because xattr_permission() does not have any restrictions
on security.*, system.* and in some cases trusted.* from VFS and
the decision is left to the underlying filesystem. See comments in
xattr_permission() for more details.
This patch checks if the root is read-only before performing the set
xattr operation.
Testcase:
DEV=/dev/vdb
MNT=/mnt
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo "file one" > $MNT/f1
setfattr -n "security.one" -v 2 $MNT/f1
btrfs property set /mnt ro true
setfattr -n "security.one" -v 1 $MNT/f1
umount $MNT
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7a6b75b79902e47f46328b57733f2604774fa2d9 upstream.
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.
So just make __add_inode_ref() return an error it gets from the extended
reference lookup.
Fixes: f186373fef005c ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc4d31684974d140250f3ee612c3f0cab13b3146 upstream.
[BUG]
If we have a btrfs image with dirty log, along with an unsupported RO
compatible flag:
log_root 30474240
...
compat_flags 0x0
compat_ro_flags 0x40000003
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x40000000 )
Then even if we can only mount it RO, we will still cause metadata
update for log replay:
BTRFS info (device dm-1): flagging fs with big metadata feature
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): has skinny extents
BTRFS info (device dm-1): start tree-log replay
This is definitely against RO compact flag requirement.
[CAUSE]
RO compact flag only forces us to do RO mount, but we will still do log
replay for plain RO mount.
Thus this will result us to do log replay and update metadata.
This can be very problematic for new RO compat flag, for example older
kernel can not understand v2 cache, and if we allow metadata update on
RO mount and invalidate/corrupt v2 cache.
[FIX]
Just reject the mount unless rescue=nologreplay is provided:
BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
We don't want to set rescue=nologreply directly, as this would make the
end user to read the old data, and cause confusion.
Since the such case is really rare, we're mostly fine to just reject the
mount with an error message, which also includes the proper workaround.
CC: stable@vger.kernel.org #4.9+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d5321a0fa8bc49f11bea0b470800962c17d92d8f upstream.
The following error message lack the "0x" obviously:
cannot mount because of unsupported optional features (4000)
Add the prefix to make it less confusing. This can happen on older
kernels that try to mount a filesystem with newer features so it makes
sense to backport to older trees.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d0e64a981fd841cb0f28fcd6afcac55e6f1e6994 upstream.
On Linux, empty symlinks are invalid, and attempting to create one with
the system call symlink(2) results in an -ENOENT error and this is
explicitly documented in the man page.
If we rename a symlink that was created in the current transaction and its
parent directory was logged before, we actually end up logging the symlink
without logging its content, which is stored in an inline extent. That
means that after a power failure we can end up with an empty symlink,
having no content and an i_size of 0 bytes.
It can be easily reproduced like this:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ sync
# Create a file inside the directory and fsync the directory.
$ touch /mnt/testdir/foo
$ xfs_io -c "fsync" /mnt/testdir
# Create a symlink inside the directory and then rename the symlink.
$ ln -s /mnt/testdir/foo /mnt/testdir/bar
$ mv /mnt/testdir/bar /mnt/testdir/baz
# Now fsync again the directory, this persist the log tree.
$ xfs_io -c "fsync" /mnt/testdir
<power failure>
$ mount /dev/sdc /mnt
$ stat -c %s /mnt/testdir/baz
0
$ readlink /mnt/testdir/baz
$
Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
their content is also logged.
A test case for fstests will follow.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2e7be9db125a0bf940c5d65eb5c40d8700f738b5 upstream.
Currently if we get IO error while doing send then we abort without
logging information about which file caused issue. So log it to help
with debugging.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9f05c09d6baef789726346397438cca4ec43c3ee ]
If we're looking for leafs that point to a data extent we want to record
the extent items that point at our bytenr. At this point we have the
reference and we know for a fact that this leaf should have a reference
to our bytenr. However if there's some sort of corruption we may not
find any references to our leaf, and thus could end up with eie == NULL.
Replace this BUG_ON() with an ASSERT() and then return -EUCLEAN for the
mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fcba0120edf88328524a4878d1d6f4ad39f2ec81 ]
We search for an extent entry with .offset = -1, which shouldn't be a
thing, but corruption happens. Add an ASSERT() for the developers,
return -EUCLEAN for mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 45da9c1767ac31857df572f0a909fbe88fd5a7e9 upstream.
Ordered work functions aren't guaranteed to be handled by the same thread
which executed the normal work functions. The only way execution between
normal/ordered functions is synchronized is via the WORK_DONE_BIT,
unfortunately the used bitops don't guarantee any ordering whatsoever.
This manifested as seemingly inexplicable crashes on ARM64, where
async_chunk::inode is seen as non-null in async_cow_submit which causes
submit_compressed_extents to be called and crash occurs because
async_chunk::inode suddenly became NULL. The call trace was similar to:
pc : submit_compressed_extents+0x38/0x3d0
lr : async_cow_submit+0x50/0xd0
sp : ffff800015d4bc20
<registers omitted for brevity>
Call trace:
submit_compressed_extents+0x38/0x3d0
async_cow_submit+0x50/0xd0
run_ordered_work+0xc8/0x280
btrfs_work_helper+0x98/0x250
process_one_work+0x1f0/0x4ac
worker_thread+0x188/0x504
kthread+0x110/0x114
ret_from_fork+0x10/0x18
Fix this by adding respective barrier calls which ensure that all
accesses preceding setting of WORK_DONE_BIT are strictly ordered before
setting the flag. At the same time add a read barrier after reading of
WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads
would be strictly ordered after reading the bit. This in turn ensures
are all accesses before WORK_DONE_BIT are going to be strictly ordered
before any access that can occur in ordered_func.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 08a9ff326418 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue")
CC: stable@vger.kernel.org # 4.4+
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 10adb1152d957a4d570ad630f93a88bb961616c1 upstream.
At replay_dir_deletes(), if find_dir_range() returns an error we break out
of the main while loop and then assign a value of 0 (success) to the 'ret'
variable, resulting in completely ignoring that an error happened. Fix
that by jumping to the 'out' label when find_dir_range() returns an error
(negative value).
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4e9655763b82a91e4c341835bb504a2b1590f984 upstream.
This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763.
[BUG]
It's no longer possible to create compressed inline extent after commit
f2165627319f ("btrfs: compression: don't try to compress if we don't
have enough pages").
[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.
- Compressed inline write
The data is always smaller than one sector and the test lacks the
condition to properly recognize a non-inline extent.
- Compressed subpage write
For the incoming subpage compressed write support, we require page
alignment of the delalloc range.
And for 64K page size, we can compress just one page into smaller
sectors.
For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback. The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.
[FIX]
Fix it by reverting the offending commit.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 240246f6b913b0c23733cfd2def1d283f8cc9bbe ]
In compression write endio sequence, the range which the compressed_bio
writes is marked as uptodate if the last bio of the compressed (sub)bios
is completed successfully. There could be previous bio which may
have failed which is recorded in cb->errors.
Set the writeback range as uptodate only if cb->errors is zero, as opposed
to checking only the last bio's status.
Backporting notes: in all versions up to 4.4 the last argument is always
replaced by "!cb->errors".
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f2165627319ffd33a6217275e5690b1ab5c45763 upstream
The early check if we should attempt compression does not take into
account the number of input pages. It can happen that there's only one
page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED
have been processed, or an isolated page that won't be converted to an
inline extent.
The single page would be compressed but a later check would drop it
again because the result size must be at least one block shorter than
the input. That can never work with just one page.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b05fbcc36be1f8597a1febef4892053a0b2f3f60 ]
With a config having PAGE_SIZE set to 256K, BTRFS build fails
with the following message
include/linux/compiler_types.h:326:38: error: call to
'__compiletime_assert_791' declared with attribute error:
BUILD_BUG_ON failed: (BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0
BTRFS_MAX_COMPRESSED being 128K, BTRFS cannot support platforms with
256K pages at the time being.
There are two platforms that can select 256K pages:
- hexagon
- powerpc
Disable BTRFS when 256K page size is selected. Supporting this would
require changes to the subpage mode that's currently being developed.
Given that 256K is many times larger than page sizes commonly used and
for what the algorithms and structures have been tuned, it's out of
scope and disabling build is a reasonable option.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 04587ad9bef6ce9d510325b4ba9852b6129eebdb ]
If we fail to update the delayed inode we need to abort the transaction,
because we could leave an inode with the improper counts or some other
such corruption behind.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6819703f5a365c95488b07066a8744841bf14231 upstream.
The defrag loop processes leaves in batches and starting transaction for
each. The whole defragmentation on a given root is protected by a bit
but in case the transaction fails, the bit is not cleared
In case the transaction fails the bit would prevent starting
defragmentation again, so make sure it's cleared.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e7b2ec3d3d4ebeb4cff7ae45cf430182fa6a49fb upstream.
We always return 0 even in case of an error in btrfs_mark_extent_written().
Fix it to return proper error value in case of a failure. All callers
handle it.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 011b28acf940eb61c000059dd9e2cfcbf52ed96b upstream.
This function has the following pattern
while (1) {
ret = whatever();
if (ret)
goto out;
}
ret = 0
out:
return ret;
However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.
Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the
ret = 0;
out:
bit so everybody can break if there is an error, which will allow for
proper error handling to occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b86652be7c83f70bf406bed18ecf55adb9bfb91b upstream.
Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent. This occurred because the pattern
in btrfs_del_csums was
while (1) {
ret = btrfs_search_slot();
if (ret < 0)
break;
}
ret = 0;
out:
btrfs_free_path(path);
return ret;
If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out. Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the
ret = 0;
out:
pattern and simply allow break to have the proper error reporting. With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7a9213a93546e7eaef90e6e153af6b8fc7553f10 ]
A few BUG_ON()'s in replace_path are purely to keep us from making
logical mistakes, so replace them with ASSERT()'s.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 67addf29004c5be9fa0383c82a364bb59afc7f84 upstream.
When creating a subvolume we allocate an extent buffer for its root node
after starting a transaction. We setup a root item for the subvolume that
points to that extent buffer and then attempt to insert the root item into
the root tree - however if that fails, due to ENOMEM for example, we do
not free the extent buffer previously allocated and we do not abort the
transaction (as at that point we did nothing that can not be undone).
This means that we effectively do not return the metadata extent back to
the free space cache/tree and we leave a delayed reference for it which
causes a metadata extent item to be added to the extent tree, in the next
transaction commit, without having backreferences. When this happens
'btrfs check' reports the following:
$ btrfs check /dev/sdi
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
[1/7] checking root items
[2/7] checking extents
ref mismatch on [30425088 16384] extent item 1, found 0
backref 30425088 root 256 not referenced back 0x564a91c23d70
incorrect global backref count on 30425088 found 1 wanted 0
backpointer mismatch on [30425088 16384]
owner ref check failed [30425088 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 212992 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124669
file data blocks allocated: 65536
referenced 65536
So fix this by freeing the metadata extent if btrfs_insert_root() returns
an error.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dbcc7d57bffc0c8cac9dac11bec548597d59a6a5 upstream.
While resolving backreferences, as part of a logical ino ioctl call or
fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
operations of a root, triggering a stack trace like the following:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1210!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G W 5.11.0-2d11c0084b02-misc-next+ #89
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
FS: 00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
Call Trace:
btrfs_search_old_slot+0x265/0x10d0
? lock_acquired+0xbb/0x600
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x30/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? resolve_indirect_refs+0xfc0/0xfc0
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x160/0x210
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? poison_range+0x38/0x40
? unpoison_range+0x14/0x40
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? lock_downgrade+0x3d0/0x3d0
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? _raw_spin_unlock+0x22/0x30
? __kasan_check_write+0x14/0x20
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x205e/0x4040
? __might_sleep+0x71/0xe0
? btrfs_ioctl_get_supported_features+0x30/0x30
? getrusage+0x4b6/0x9c0
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __might_fault+0x64/0xd0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __task_pid_nr_ns+0xd3/0x250
? lock_acquire+0xc7/0x510
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fd1976e2427
Code: 00 00 90 48 8b 05 (...)
RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
Modules linked in:
---[ end trace ec8931a1c36e57be ]---
(gdb) l *(__tree_mod_log_rewind+0x3b1)
0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
1205 * the modification. as we're going backwards, we do the
1206 * opposite of each operation here.
1207 */
1208 switch (tm->op) {
1209 case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
1210 BUG_ON(tm->slot < n);
1211 fallthrough;
1212 case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
1213 case MOD_LOG_KEY_REMOVE:
1214 btrfs_set_node_key(eb, &tm->key, tm->slot);
Here's what happens to hit that BUG_ON():
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
tree_mod_log_insert_root(eb X, eb Y, 1);
3) At tree_mod_log_insert_root() we create tree mod log elements for each
slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
with a ->logical pointing to ebX->start. These are placed in an array
named tm_list.
Lets assume there are N elements (N pointers in eb X);
4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
to the level of eb X and ->generation set to the generation of eb X;
5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
tm_list as argument. After that, tree_mod_log_free_eb() calls
__tree_mod_log_insert() for each member of tm_list in reverse order,
from highest slot in eb X, slot N - 1, to slot 0 of eb X;
6) __tree_mod_log_insert() sets the sequence number of each given tree mod
log operation - it increments fs_info->tree_mod_seq and sets
fs_info->tree_mod_seq as the sequence number of the given tree mod log
operation.
This means that for the tm_list created at tree_mod_log_insert_root(),
the element corresponding to slot 0 of eb X has the highest sequence
number (1 + N), and the element corresponding to the last slot has the
lowest sequence number (2);
7) Then, after inserting tm_list's elements into the tree mod log rbtree,
the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
sequence number, which is N + 2;
8) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
9) Later some other task T allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
10) The tree mod log user task calls btrfs_search_old_slot(), which calls
get_old_root(), and finally that calls __tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
11) First iteration of the while loop finds the tree mod log element with
sequence number N + 2, for the logical address of eb Y and of type
MOD_LOG_ROOT_REPLACE;
12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
of the loop, and set root_logical to point to tm->old_root.logical
which corresponds to the logical address of eb X;
13) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
the old slot N - 1 of eb X (eb X had N items in it before being freed);
14) We then break out of the while loop and return the tree mod log operation
of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
eb X, to get_old_root();
15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
and set "logical" to the logical address of eb X, which was the old
root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
which results in adding a tree mod log operation of type
MOD_LOG_KEY_ADD to the tree mod log - this is done at
ctree.c:insert_ptr() - but after adding the tree mod log operation
and before updating the number of items in eb X from 0 to 1...
17) The task at get_old_root() calls tree_mod_log_search() and gets the
tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
Then it enters the following if branch:
if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
(...)
} (...)
Calls read_tree_block() for eb X, which gets a reference on eb X but
does not lock it - task T has it locked.
Then it clones eb X while it has nritems set to 0 in its header, before
task T sets nritems to 1 in eb X's header. From hereupon we use the
clone of eb X which no other task has access to;
18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
mod log operation we just got from tree_mod_log_search() in the
previous step and the cloned version of eb X;
19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
of items set in eb X's clone, which is 0. Then we enter the while loop,
and in its first iteration we process the MOD_LOG_KEY_ADD operation,
which just decrements "n" from 0 to (u32)-1, since "n" is declared with
a type of u32. At the end of this iteration we call rb_next() to find the
next tree mod log operation for eb X, that gives us the mod log operation
of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
number of N + 1 (steps 3 to 6);
20) Then we go back to the top of the while loop and trigger the following
BUG_ON():
(...)
switch (tm->op) {
case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
Fix this by taking a read lock on the extent buffer before cloning it at
ctree.c:get_old_root(). This should be done regardless of the extent
buffer having been freed and reused, as a concurrent task might be
modifying it (while holding a write lock on it).
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d70cef0d46729808dc53f145372c02b145c92604 upstream.
When a qstripe is required an extra page is allocated and mapped. There
were 3 problems:
1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
through the loop.
The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping. So map the
page for the duration of the loop.
Similarly, improve the algorithm by mapping the parity page just 1 time.
Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4.x: c17af96554a8: btrfs: raid56: simplify tracking of Q stripe presence
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c17af96554a8a8777cbb0fd53b8497250e548b43 upstream.
There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.
fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
1199 | int p_stripe = -1;
| ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
2356 | int p_stripe = -1;
| ^~~~~~~~
Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 72c9925f87c8b74f36f8e75a4cd93d964538d3ca upstream.
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
So fix that by unlocking the extent buffer and dropping our reference on
it before returning.
Fixes: be20aa9dbadc8c ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c78a10aebb275c38d0cfccae129a803fe622e305 upstream.
When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later. The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because we never call __add_reloc_root() on the reloc_root.
This exit condition simply isn't correct really. During normal
operation we can remove ourselves from the rb tree and then we're meant
to clean up later at merge_reloc_roots() time, and this happens
correctly. During recovery we're depending on free_reloc_roots() to
drop our references, but we're short-circuiting.
Fix this by continuing to check if we're on the list and dropping
ourselves from the reloc_control root list and dropping our reference
appropriately. Change the corresponding BUG_ON() to an ASSERT() that
does the correct thing if we aren't in the rb tree.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 867ed321f90d06aaba84e2c91de51cd3038825ef upstream.
While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing. However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:
btrfs_inc_extent_ref
btrfs_copy_root
create_reloc_root
btrfs_init_reloc_root
btrfs_record_root_in_trans
btrfs_start_transaction
btrfs_update_inode
btrfs_update_time
touch_atime
file_accessed
btrfs_file_mmap
This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem. Fix this by aborting the transaction if we fail
to do the reference modification.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3cc64e7ebfb0d7faaba2438334c43466955a96e8 ]
Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.
Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream
Some of the self tests create a test inode, setup some extents and then do
calls to btrfs_get_extent() to test that the corresponding extent maps
exist and are correct. However btrfs_get_extent(), since the 5.2 merge
window, now errors out when it finds a regular or prealloc extent for an
inode that does not correspond to a regular file (its ->i_mode is not
S_IFREG). This causes the self tests to fail sometimes, specially when
KASAN, slub_debug and page poisoning are enabled:
$ modprobe btrfs
modprobe: ERROR: could not insert 'btrfs': Invalid argument
$ dmesg
[ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
[ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096
[ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
[ 9414.692918] BTRFS: selftest: running extent only tests
[ 9414.693061] BTRFS: selftest: running bitmap only tests
[ 9414.693366] BTRFS: selftest: running bitmap and extent tests
[ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
[ 9414.697131] BTRFS: selftest: running extent buffer operation tests
[ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
[ 9414.697564] BTRFS: selftest: running extent I/O tests
[ 9414.697583] BTRFS: selftest: running find delalloc tests
[ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
[ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
[ 9415.124192] BTRFS: selftest: running inode tests
[ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
[ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
[ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
[ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
This happens because the test inodes are created without ever initializing
the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
i_mode is done through the various callbacks used by the VFS to create
new inodes (regular files, directories, symlinks, tmpfiles, etc), which
all call btrfs_new_inode() which in turn calls inode_init_owner(), which
sets the inode's i_mode. Since the tests only uses new_inode() to create
the test inodes, the i_mode was never initialized.
This always happens on a VM I used with kasan, slub_debug and many other
debug facilities enabled. It also happened to someone who reported this
on bugzilla (on a 5.3-rc).
Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 665d4953cde6d9e75c62a07ec8f4f8fd7d396ade upstream
In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.
However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().
In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.
So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.
This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.
This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.
The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.
This patch disables the faulty code and will be removed completely in a
followup patch.
Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commoit 6f7de19ed3d4d3526ca5eca428009f97cf969c2f upstream
Commit ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf
of extent tree") added a new exit for rescan finish.
However after finishing quota rescan, we set
fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
original exit path.
While we missed that assignment of (u64)-1 in the new exit path.
The end result is, the quota status item doesn't have the same value.
(-1 vs the last bytenr + 1)
Although it doesn't affect quota accounting, it's still better to keep
the original behavior.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Fixes: ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6bf9e4bd6a277840d3fe8c5d5d530a1fbd3db592 upstream
[BUG]
When accessing a file on a crafted image, btrfs can crash in block layer:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 136501067 P4D 136501067 PUD 124519067 PMD 0
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252
RIP: 0010:end_bio_extent_readpage+0x144/0x700
Call Trace:
<IRQ>
blk_update_request+0x8f/0x350
blk_mq_end_request+0x1a/0x120
blk_done_softirq+0x99/0xc0
__do_softirq+0xc7/0x467
irq_exit+0xd1/0xe0
call_function_single_interrupt+0xf/0x20
</IRQ>
RIP: 0010:default_idle+0x1e/0x170
[CAUSE]
The crafted image has a tricky corruption, the INODE_ITEM has a
different type against its parent dir:
item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160
generation 13 transid 13 size 1048576 nbytes 1048576
block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0
sequence 9 flags 0x0(none)
This mode number 0120000 means it's a symlink.
But the dir item think it's still a regular file:
item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32
location key (268 INODE_ITEM 0) type FILE
transid 13 data_len 0 name_len 2
name: f4
item 40 key (264 DIR_ITEM 51821248) itemoff 1573 itemsize 32
location key (268 INODE_ITEM 0) type FILE
transid 13 data_len 0 name_len 2
name: f4
For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it
empty, as symlink is only designed to have inlined extent, all handled
by tree block read. Thus no need to trigger btrfs_submit_bio_hook() for
inline file extent.
However end_bio_extent_readpage() expects tree->ops populated, as it's
reading regular data extent. This causes NULL pointer dereference.
[FIX]
This patch fixes the problem in two ways:
- Verify inode mode against its dir item when looking up inode
So in btrfs_lookup_dentry() if we find inode mode mismatch with dir
item, we error out so that corrupted inode will not be accessed.
- Verify inode mode when getting extent mapping
Only regular file should have regular or preallocated extent.
If we found regular/preallocated file extent for symlink or
the rest, we error out before submitting the read bio.
With this fix that crafted image can be rejected gracefully:
BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=0121644 btrfs type=7 dir type=1
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: use original btrfs_inode_type(), btrfs_crit with root->fs_info,
ISREG with inode->i_mode and adjust context]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 80e46cf22ba0bcb57b39c7c3b52961ab3a0fd5f2 upstream
Btrfs-progs already have a comprehensive type checker, to ensure there
is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk
profile bits.
Do the same work for kernel.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: manually backport, use btrfs_err with root->fs_info]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0607eb1d452d45c5ac4c745a9e9e0d95152ea9d0 ]
If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 83bc1560e02e25c6439341352024ebe8488f4fbd upstream.
If we fail to find suitable zones for a new readahead extent, we end up
leaving a stale pointer in the global readahead extents radix tree
(fs_info->reada_tree), which can trigger the following trace later on:
[13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
[13367.696802] #PF: supervisor read access in kernel mode
[13367.697249] #PF: error_code(0x0000) - not-present page
[13367.697721] PGD 0 P4D 0
[13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G W 5.9.0-rc6-btrfs-next-69 #1
[13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
[13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
[13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
[13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
[13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
[13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
[13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
[13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
[13367.704880] FS: 00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
[13367.705438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
[13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[13367.707686] Call Trace:
[13367.708246] ? ___slab_alloc+0x395/0x740
[13367.708820] ? reada_add_block+0xae/0xee0 [btrfs]
[13367.709383] lock_acquire+0xb1/0x480
[13367.709955] ? reada_add_block+0xe0/0xee0 [btrfs]
[13367.710537] ? reada_add_block+0xae/0xee0 [btrfs]
[13367.711097] ? rcu_read_lock_sched_held+0x5d/0x90
[13367.711659] ? kmem_cache_alloc_trace+0x8d2/0x990
[13367.712221] ? lock_acquired+0x33b/0x470
[13367.712784] _raw_spin_lock+0x34/0x80
[13367.713356] ? reada_add_block+0xe0/0xee0 [btrfs]
[13367.713966] reada_add_block+0xe0/0xee0 [btrfs]
[13367.714529] ? btrfs_root_node+0x15/0x1f0 [btrfs]
[13367.715077] btrfs_reada_add+0x117/0x170 [btrfs]
[13367.715620] scrub_stripe+0x21e/0x10d0 [btrfs]
[13367.716141] ? kvm_sched_clock_read+0x5/0x10
[13367.716657] ? __lock_acquire+0x41e/0x3970
[13367.717184] ? scrub_chunk+0x60/0x140 [btrfs]
[13367.717697] ? find_held_lock+0x32/0x90
[13367.718254] ? scrub_chunk+0x60/0x140 [btrfs]
[13367.718773] ? lock_acquired+0x33b/0x470
[13367.719278] ? scrub_chunk+0xcd/0x140 [btrfs]
[13367.719786] scrub_chunk+0xcd/0x140 [btrfs]
[13367.720291] scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
[13367.720787] ? finish_wait+0x90/0x90
[13367.721281] btrfs_scrub_dev+0x1ee/0x620 [btrfs]
[13367.721762] ? rcu_read_lock_any_held+0x8e/0xb0
[13367.722235] ? preempt_count_add+0x49/0xa0
[13367.722710] ? __sb_start_write+0x19b/0x290
[13367.723192] btrfs_ioctl+0x7f5/0x36f0 [btrfs]
[13367.723660] ? __fget_files+0x101/0x1d0
[13367.724118] ? find_held_lock+0x32/0x90
[13367.724559] ? __fget_files+0x101/0x1d0
[13367.724982] ? __x64_sys_ioctl+0x83/0xb0
[13367.725399] __x64_sys_ioctl+0x83/0xb0
[13367.725802] do_syscall_64+0x33/0x80
[13367.726188] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[13367.726574] RIP: 0033:0x7fe1df7add87
[13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
[13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
[13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
[13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
[13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
[13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
[13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
[13367.732638] CR2: 00000000000000b0
[13367.733166] ---[ end trace d298b6805556acd9 ]---
What happens is the following:
1) At reada_find_extent() we don't find any existing readahead extent for
the metadata extent starting at logical address X;
2) So we proceed to create a new one. We then call btrfs_map_block() to get
information about which stripes contain extent X;
3) After that we iterate over the stripes and create only one zone for the
readahead extent - only one because reada_find_zone() returned NULL for
all iterations except for one, either because a memory allocation failed
or it couldn't find the block group of the extent (it may have just been
deleted);
4) We then add the new readahead extent to the readahead extents radix
tree at fs_info->reada_tree;
5) Then we iterate over each zone of the new readahead extent, and find
that the device used for that zone no longer exists, because it was
removed or it was the source device of a device replace operation.
Since this left 'have_zone' set to 0, after finishing the loop we jump
to the 'error' label, call kfree() on the new readahead extent and
return without removing it from the radix tree at fs_info->reada_tree;
6) Any future call to reada_find_extent() for the logical address X will
find the stale pointer in the readahead extents radix tree, increment
its reference counter, which can trigger the use-after-free right
away or return it to the caller reada_add_block() that results in the
use-after-free of the example trace above.
So fix this by making sure we delete the readahead extent from the radix
tree if we fail to setup zones for it (when 'have_zone = 0').
Fixes: 319450211842ba ("btrfs: reada: bypass adding extent when all zone failed")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 572c83acdcdafeb04e70aa46be1fa539310be20c upstream.
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:
$ cat /proc/2143497/stack
[<0>] __btrfs_tree_lock+0x108/0x250
[<0>] lock_extent_buffer_for_io+0x35e/0x3a0
[<0>] btree_write_cache_pages+0x15a/0x3b0
[<0>] do_writepages+0x28/0xb0
[<0>] __writeback_single_inode+0x54/0x5c0
[<0>] writeback_sb_inodes+0x1e8/0x510
[<0>] wb_writeback+0xcc/0x440
[<0>] wb_workfn+0xd7/0x650
[<0>] process_one_work+0x236/0x560
[<0>] worker_thread+0x55/0x3c0
[<0>] kthread+0x13a/0x150
[<0>] ret_from_fork+0x1f/0x30
This is because we got an error while COWing a block, specifically here
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
ret = btrfs_reloc_cow_block(trans, root, buf, cow);
if (ret) {
btrfs_abort_transaction(trans, ret);
return ret;
}
}
[16402.241552] BTRFS: Transaction aborted (error -2)
[16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
[16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
[16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
[16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
[16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
[16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
[16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
[16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
[16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
[16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
[16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
[16402.257867] Call Trace:
[16402.258072] btrfs_cow_block+0x109/0x230
[16402.258356] btrfs_search_slot+0x530/0x9d0
[16402.258655] btrfs_lookup_file_extent+0x37/0x40
[16402.259155] __btrfs_drop_extents+0x13c/0xd60
[16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
[16402.259949] btrfs_replace_file_extents+0x190/0x820
[16402.260873] btrfs_clone+0x9ae/0xc00
[16402.261139] btrfs_extent_same_range+0x66/0x90
[16402.261771] btrfs_remap_file_range+0x353/0x3b1
[16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
[16402.262821] vfs_dedupe_file_range+0x189/0x220
[16402.263150] do_vfs_ioctl+0x552/0x700
[16402.263662] __x64_sys_ioctl+0x62/0xb0
[16402.264023] do_syscall_64+0x33/0x40
[16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16402.264862] RIP: 0033:0x7f542c7d15cb
[16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
[16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
[16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
[16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
[16402.271498] irq event stamp: 0
[16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
[16402.274338] ---[ end trace 737874a5a41a8236 ]---
[16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.276179] BTRFS info (device dm-9): forced readonly
[16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
[16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode. This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.
Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bb56f02f26fe23798edb1b2175707419b28c752a upstream.
Logging directories with many entries can take a significant amount of
time, and in some cases monopolize a cpu/core for a long time if the
logging task doesn't happen to block often enough.
Johannes and Lu Fengqi reported test case generic/041 triggering a soft
lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
case we log an inode with 3002 hard links, and because the test removed
one hard link before fsyncing the file, the inode logging causes the
parent directory do be logged as well, which has 6004 directory items to
log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
so it can take a significant amount of time and trigger the soft lockup.
So just make tree-log.c:log_dir_items() reschedule when necessary,
releasing the current search path before doing so and then resume from
where it was before the reschedule.
The stack trace produced when the soft lockup happens is the following:
[10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
[10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
[10480.284915] irq event stamp: 29646366
[10480.285987] hardirqs last enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
[10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
[10480.290856] softirqs last enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
[10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
[10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
[10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
[10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
[10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
[10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
[10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
[10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
[10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
[10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
[10480.321104] FS: 00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
[10480.323559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
[10480.327259] Call Trace:
[10480.328286] ? overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.329784] __kmalloc+0x831/0xa20
[10480.331009] ? btrfs_get_32+0xb0/0x1d0 [btrfs]
[10480.332464] overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.333948] log_dir_items+0x2ee/0x570 [btrfs]
[10480.335413] log_directory_changes+0x82/0xd0 [btrfs]
[10480.336926] btrfs_log_inode+0xc9b/0xda0 [btrfs]
[10480.338374] ? init_once+0x20/0x20 [btrfs]
[10480.339711] btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
[10480.341257] ? dget_parent+0x97/0x2e0
[10480.342480] btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
[10480.343977] btrfs_sync_file+0x24b/0x5e0 [btrfs]
[10480.345381] do_fsync+0x38/0x70
[10480.346483] __x64_sys_fsync+0x10/0x20
[10480.347703] do_syscall_64+0x2d/0x70
[10480.348891] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10480.350444] RIP: 0033:0x7fa1dc80970b
[10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
[10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
[10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
[10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
[10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
[10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60
Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
CC: stable@vger.kernel.org # 4.4+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7c09c03091ac562ddca2b393e5d65c1d37da79f1 ]
Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
forced read-only. This is not a transaction abort and the filesystem is
otherwise ok, so the error should be just propagated to the callers.
This is caused by unnecessary call to btrfs_handle_fs_error for all
errors, except EAGAIN. This does not make sense as the standard
transaction abort mechanism is in btrfs_drop_snapshot so all relevant
failures are handled.
Originally in commit cb1b69f4508a ("Btrfs: forced readonly when
btrfs_drop_snapshot() fails") there was no return value at all, so the
btrfs_std_error made some sense but once the error handling and
propagation has been implemented we don't need it anymore.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1c78544eaa4660096aeb6a57ec82b42cdb3bfe5a upstream.
When faulting in the pages for the user supplied buffer for the search
ioctl, we are passing only the base address of the buffer to the function
fault_in_pages_writeable(). This means that after the first iteration of
the while loop that searches for leaves, when we have a non-zero offset,
stored in 'sk_offset', we try to fault in a wrong page range.
So fix this by adding the offset in 'sk_offset' to the base address of the
user supplied buffer when calling fault_in_pages_writeable().
Several users have reported that the applications compsize and bees have
started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix
potential deadlock in the search ioctl") was added to stable trees, and
these applications make heavy use of the search ioctls. This fixes their
issues.
Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/
Link: https://github.com/kilobyte/compsize/issues/34
Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl")
CC: stable@vger.kernel.org # 4.4+
Tested-by: A L <mail@lechevalier.se>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>