IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
It simply gets assigned to 'ret' in case of errors. The flow of the
while loop is not changed by this commit since the few call sites
that 'goto next' will simply break from the loop.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In most cases when an error is returned from a function 'ret' is simply
assigned to 'err'. There is only one case where walk_up_reloc_tree can
return a positive value - in this case the code breaks from the loop and
ret is going to get its return value from btrfs_cow_block - either 0 or
negative. This retains the old logic of how 'err' used to be set at
this call site.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use only a single 'ret' to control whether we should abort the
transaction or not. That's fine, because if we abort a transaction then
btrfs_end_transaction will return the same value as passed to
btrfs_abort_transaction. No semantic changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are attempting to start writeback for an existing extent in NOCOW
mode, at run_delalloc_nocow(), we must check if the extent is shared, and
if it is, fallback to a COW write. However we do such check while still
holding a read lock on the leaf that contains the file extent item, and
that check, the call to btrfs_cross_ref_exist(), can take some time
because:
1) It needs to do a search on the extent tree, which obviously takes some
time, specially if delayed references are being run at the moment, as
we can block when trying to lock currently write locked btree nodes;
2) It needs to check the delayed references for any existing reference
for our data extent, this requires acquiring the delayed references'
spinlock and maybe block on the mutex of a delayed reference head in the
case where there is a delayed reference for our data extent, in the
worst case it makes us release the path on the extent tree and retry
the whole process again (going back to step 1).
There are other operations we do while holding the leaf locked that can
take some significant time as well (specially all together):
* btrfs_extent_readonly() - to check if the block group containing the
extent is currently in RO mode. This requires taking a spinlock and
searching for the block group in a rbtree that can be big on large
filesystems;
* csum_exist_in_range() - to search if there are any checksums in the
csum tree for the extent. Like before, this can take some time if we are
in a filesystem that has both COW and NOCOW files, in which case the
csum tree is not empty;
* btrfs_inc_nocow_writers() - increment the number of nocow writers in the
block group that contains the data extent. Needs to acquire a spinlock
and search for the block group in a rbtree that can be big on large
filesystems.
So just unlock the leaf (release the path) before doing all those checks,
since we do not need it anymore. In case we can not do a NOCOW write for
the extent, due to any of those checks failing, and the writeback range
goes beyond that extents' length, we will do another btree search for the
next file extent item.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian):
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-m single -d single"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
Before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9326331 0.317 399.957
Close 6851198 0.002 6.402
Rename 394894 2.621 402.819
Unlink 1883131 0.931 398.082
Deltree 256 19.160 303.580
Mkdir 128 0.003 0.016
Qpathinfo 8452314 0.068 116.133
Qfileinfo 1481921 0.001 5.081
Qfsinfo 1549963 0.002 4.444
Sfileinfo 759679 0.084 17.079
Find 3268168 0.396 118.196
WriteX 4653310 0.056 110.993
ReadX 14618818 0.005 23.314
LockX 30364 0.003 0.497
UnlockX 30364 0.002 1.720
Flush 653619 16.954 569.299
Throughput 966.651 MB/sec 64 clients 64 procs max_latency=569.377 ms
After this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9710433 0.302 232.449
Close 7132948 0.002 11.496
Rename 411144 2.452 131.805
Unlink 1960961 0.893 230.383
Deltree 256 14.858 198.646
Mkdir 128 0.002 0.005
Qpathinfo 8800890 0.066 111.588
Qfileinfo 1542556 0.001 3.852
Qfsinfo 1613835 0.002 5.483
Sfileinfo 790871 0.081 19.492
Find 3402743 0.386 120.185
WriteX 4842918 0.054 179.312
ReadX 15220407 0.005 32.435
LockX 31612 0.003 1.533
UnlockX 31612 0.002 1.047
Flush 680567 16.320 463.323
Throughput 1016.59 MB/sec 64 clients 64 procs max_latency=463.327 ms
+5.0% throughput, -20.5% max latency
Also, the following test using fio was run:
$ cat test-fio.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 4 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo
echo "Using fio config:"
echo
cat /tmp/fio-job.ini
echo
echo "mount options: $MOUNT_OPTIONS"
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo "Creating nodatacow files before fio runs..."
for ((i = 0; i < $NUM_JOBS; i++)); do
xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
done
sync
fio /tmp/fio-job.ini
umount $MNT
Before this change:
$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
After this change:
$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
+9.7% throughput, -9.8% runtime
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree checker is called many times as it verifies metadata at
read/write time. The checks follow a simple pattern:
if (error_condition) {
report_error();
return -EUCLEAN;
}
All the error reporting functions are annotated as __cold that is
supposed to hint the compiler to move the statement block out of the hot
path. This does not seem to happen that often.
As the error condition is expected to be false almost always, we can
annotate it with 'unlikely' as this satisfies one of the few use cases
for the annotation. The expected outcome is a stronger hint to compiler
to reorder the checks
test
jump to exit
test
jump to exit
...
which can be observed in asm of eg. check_dir_item,
btrfs_check_chunk_valid, check_root_item or check_leaf.
There's a measurable run time improvement reported by Josef, the testing
workload went from 655 MiB/s to 677 MiB/s, which is about +3%.
There should be no functional changes but some of the conditions have
been rewritten to produce more readable result, some lines are longer
than 80, for the sake of readability.
Signed-off-by: David Sterba <dsterba@suse.com>
Without a NULL fs_info the helpers will print something like
BTRFS error (device <unknown>): ...
This can happen in contexts where fs_info is not available at all or
it's potentially unsafe due to object lifetime. The <unknown> stub does
not bring much information and with the prefix makes the message
unnecessarily longer.
Remove it for the NULL fs_info case.
BTRFS error: ...
Callers can add the device information to the message itself if needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In alloc_extent_buffer(), after we got a page from btree inode, we check
if that page has private pointer attached.
If attached, we check if the existing extent buffer has proper refs.
If not (the eb is being freed), we will detach that private eb pointer.
The point here is, we are detaching that eb pointer by calling:
- ClearPagePrivate()
- put_page()
The put_page() here is especially confusing, as it's decreasing the ref
from attach_page_private(). Without knowing that, it looks like the
put_page() is for the find_or_create_page() call, confusing the reader.
Since we're always modifying page private with attach_page_private() and
detach_page_private(), the only open-coded detach_page_private() here is
really confusing.
Fix it by calling detach_page_private().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
start readahead in the csum tree.
However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.
The meaning of the value is pretty hard to guess, especially when the
immediate number is from the times when 4K sectorsize was the default
and only CRC32C was supported.
For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.
Now let's be more reasonable by taking both csum size and node size into
consideration.
If the csum size for the bio is larger than one leaf, then we kick the
readahead. This means for current default btrfs, the threshold will be
16M.
This change should not change performance observably, thus this is
mostly a readability enhancement.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_invalidatepage() will try to clear all possible bits since it's
calling clear_extent_bit() with delete == 1.
This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit. But this could be a problem for later subpage support,
which will utilize extra io tree bit to represent additional info.
This patch will just convert that clear_extent_bit() to
unlock_extent_cached().
For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.
But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.
Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.
But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That anonymous structure serve no special purpose, just replace it with
regular members.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new helper to handle update page status in
end_bio_extent_readpage(). This will be later used for subpage support
where the page status update can be more complex than now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In extent-io-test, there are two invalid tests:
- Invalid nodesize for test_eb_bitmaps()
Instead of the sectorsize and nodesize combination passed in, we're
always using hand-crafted nodesize, e.g:
len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
? sectorsize * 4 : sectorsize;
In above case, if we have 32K page size, then we will get a length of
128K, which is beyond max node size, and obviously invalid.
The common page size goes up to 64K so we haven't hit that
- Invalid extent buffer bytenr
For 64K page size, the only combination we're going to test is
sectorsize = nodesize = 64K.
However, in that case we will try to test an eb which bytenr is not
sectorsize aligned:
/* Do it over again with an extent buffer which isn't page-aligned. */
eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);
Sector alignment is a hard requirement for any sector size.
The only exception is superblock. But anything else should follow
sector size alignment.
This is definitely an invalid test case.
This patch will fix both problems by:
- Honor the sectorsize/nodesize combination
Now we won't bother to hand-craft the length and use it as nodesize.
- Use sectorsize as the 2nd run extent buffer start
This would test the case where extent buffer is aligned to sectorsize
but not always aligned to nodesize.
Please note that, later subpage related cleanup will reduce
extent_buffer::pages[] to exactly what we need, making the sector
unaligned extent buffer operations cause problems.
Since only extent_io self tests utilize this, this patch is required for
all later cleanup/refactoring.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A semicolon is not needed after a switch statement.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is needlessly convoluted. Fix that by:
* removing redundant sret variable definition in both if arms
* replace the again/done labels with direct return statements, the
function is short enough and doesn't do anything special upon exit
* remove BUG_ON on split_node returning a positive number - it can't
happen as split_node returns either 0 or a negative error code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the point when we set 'ret = 0' it's guaranteed that the function is
going to return 0 so directly return 0. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At inode.c:cow_file_range_inline(), after we insert the inline extent
in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
all extent maps in the file range, however that is not necessary because
we have already done it in the call to btrfs_drop_extents(), which calls
btrfs_drop_extent_cache() for us, and since at this point we have the file
range locked in the inode's iotree (we are in the writeback path), we know
no other task can come in and read stale file extent items or find none
and therefore create either stale extent maps or an extent map that
represents a hole.
So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
doing nothing and only wasting time. This call has been around since 2008,
introduced in commit c8b978188c ("Btrfs: Add zlib compression support"),
but even back then it seems it was not necessary, since we had the range
locked in the inode's iotree and the call to btrfs_drop_extents() already
used to always call btrfs_drop_extent_cache().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When joining a log transaction we acquire the root's log mutex, then
increment the root's log batch and log writers counters while holding
the mutex. However we don't need to increment the log batch there,
because we are holding the mutex and incremented the log writers counter
as well, so any other task trying to sync log will wait for the current
task to finish its logging and still achieve the desired log batching.
Since the log batch counter is an atomic counter and is incremented twice
at the very beginning of the fsync callback (btrfs_sync_file()), once
before flushing delalloc and once again after waiting for writeback to
complete, eliminating its increment when joining the log transaction
may provide some performance gains in case we have multiple concurrent
tasks doing fsyncs against different files in the same subvolume, as it
reduces contention on the atomic (locking the cacheline and bouncing it).
When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against
different files of the same subvolume, on top of a zram device, I could
consistently see gains (higher throughput) between 1% to 2%, which is a
very low value and possibly hard to be observed with a real device (I
couldn't observe consistent gains with my low/mid end NVMe device).
So this change is mostly motivated to just simplify the logic, as updating
the log batch counter is only relevant when an fsync starts and while not
holding the root's log mutex.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.
The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.
This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):
$ cat test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 200 40
umount $MNT
The results before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5761605 0.172 312.057
Close 4232452 0.002 10.927
Rename 243937 1.406 277.344
Unlink 1163456 0.631 298.402
Deltree 160 11.581 221.107
Mkdir 80 0.003 0.005
Qpathinfo 5221410 0.065 122.309
Qfileinfo 915432 0.001 3.333
Qfsinfo 957555 0.003 3.992
Sfileinfo 469244 0.023 20.494
Find 2018865 0.448 123.659
WriteX 2874851 0.049 118.529
ReadX 9030579 0.004 21.654
LockX 18754 0.003 4.423
UnlockX 18754 0.002 0.331
Flush 403792 10.944 359.494
Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms
The results after this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 6442521 0.159 230.693
Close 4732357 0.002 10.972
Rename 272809 1.293 227.398
Unlink 1301059 0.563 218.500
Deltree 160 7.796 54.887
Mkdir 80 0.008 0.478
Qpathinfo 5839452 0.047 124.330
Qfileinfo 1023199 0.001 4.996
Qfsinfo 1070760 0.003 5.709
Sfileinfo 524790 0.033 21.765
Find 2257658 0.314 125.611
WriteX 3211520 0.040 232.135
ReadX 10098969 0.004 25.340
LockX 20974 0.003 1.569
UnlockX 20974 0.002 3.475
Flush 451553 10.287 331.037
Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms
+10.8% throughput, -8.2% max latency
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 direct calls to set_extent_bit outside of extent-io -
in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
thin wrappers around __set_extent_bit. This adds unnecessary indirection
and just makes it more annoying when looking at the various extent bit
manipulation functions. This patch renames __set_extent_bit to
set_extent_bit effectively removing a level of indirection. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat and remove __must_check ]
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is unused everywhere now, it can be removed.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is completely unused now, remove it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer use recursion, so
__btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock.
Replace this call with the simple helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer have recursive locking and there's no need for separate
helpers that allowed the transition to rwsem with minimal code changes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're no longer using recursion, rip out all of the supporting
code. Follow up patches will clean up the callers of these functions.
The extent_buffer::lock_owner is still retained as it allows safety
checks in btrfs_init_new_buffer for the case that the free space cache
is corrupted and we try to allocate a block that we are currently using
and have locked in the path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With my async free space cache loading patches ("btrfs: load free space
cache asynchronously") we no longer have a user of path->recurse and can
remove it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc2-btrfs-next-71 #1 Not tainted
------------------------------------------------------
find/324157 is trying to acquire lock:
ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
but task is already holding lock:
ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
lock_acquire+0xd8/0x490
down_write_nested+0x44/0x120
__btrfs_tree_lock+0x27/0x120 [btrfs]
btrfs_search_slot+0x2a3/0xc50 [btrfs]
btrfs_insert_empty_items+0x58/0xa0 [btrfs]
insert_with_overflow+0x44/0x110 [btrfs]
btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
btrfs_setxattr+0xd6/0x4c0 [btrfs]
btrfs_setxattr_trans+0x68/0x100 [btrfs]
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x70/0x200
vfs_setxattr+0x6b/0x120
setxattr+0x125/0x240
path_setxattr+0xba/0xd0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
lock_acquire+0xd8/0x490
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
*** DEADLOCK ***
5 locks held by find/324157:
#0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
#1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
#2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
stack backtrace:
CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
check_noncircular+0xff/0x110
? mark_lock.part.0+0x468/0xe90
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
lock_acquire+0xd8/0x490
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
down_read_nested+0x45/0x220
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
? filldir+0x1d0/0x1d0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because btrfs_next_old_leaf searches down to our current
key, and then walks up the path until we can move to the next slot, and
then reads back down the path so we get the next leaf.
However it doesn't unlock any lower levels until it replaces them with
the new extent buffer. This is technically fine, but of course causes
lockdep to complain, because we could be holding locks on lower levels
while locking upper levels.
Fix this by dropping all nodes below the level that we use as our new
starting point before we start reading back down the path. This also
allows us to drop the nested/recursive locking magic, because we're no
longer locking two nodes at the same level anymore.
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are carrying around this next_rw_lock from when we would do spinning
vs blocking read locks. Now that we have the rwsem locking we can
simply use the read lock flag unconditionally and the read lock helpers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
check if the parameter seed is true in the function btrfs_find_device().
This tells it whether to traverse the seed device list or not.
After this commit, the argument is unused and can be removed.
In device_list_add() it's not necessary because fs_devices always points
to the device's fs_devices. So with the devid+uuid matching, it will
find the right device and return, thus not needing to traverse seed
devices.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop the condition in verify_one_dev_extent,
btrfs_device::disk_total_bytes is set even for a seed device. The
comment is wrong, the size is properly set when cloning the device.
Commit 1b3922a8bc ("btrfs: Use real device structure to verify
dev extent") introduced it but it's unclear why the total_disk_bytes
was 0.
Theoretically, all devices (including missing and seed) marked with the
BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
fill_device_from_item():
open_ctree()
btrfs_read_chunk_tree()
read_one_dev()
open_seed_device()
fill_device_from_item()
Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
a bug to be fixed somewhere else and not in verify_one_dev_extent() as
it's just a messenger. It is never expected that a total_disk_bytes
shall be zero.
The function fill_device_from_item() does the job of reading it from the
item and updating btrfs_device::disk_total_bytes. So both the missing
device and the seed devices do have their disk_total_bytes updated.
btrfs_find_device can also return a device from fs_info->seed_list
because it searches it as well.
Furthermore, while removing the device if there is a power loss, we
could have a device with its total_bytes = 0, that's still valid.
Instead, introduce a check against maximum block device size in
read_one_dev().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit cf89af146b ("btrfs: dev-replace: fail mount if we don't have
replace item with target device") dropped the multi stage operation of
btrfs_free_extra_devids() that does not need to check replace target
anymore and we can remove the 'step' argument.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>