2018-04-03 20:16:55 +03:00
/* SPDX-License-Identifier: GPL-2.0 */
2008-01-08 23:46:30 +03:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*/
2018-04-03 20:16:55 +03:00
# ifndef BTRFS_ORDERED_DATA_H
# define BTRFS_ORDERED_DATA_H
2008-01-08 23:46:30 +03:00
2008-07-17 21:53:27 +04:00
/* one of these per inode */
2008-01-08 23:46:30 +03:00
struct btrfs_ordered_inode_tree {
2010-02-03 00:48:28 +03:00
spinlock_t lock ;
2008-01-08 23:46:30 +03:00
struct rb_root tree ;
2008-07-17 20:53:50 +04:00
struct rb_node * last ;
2008-01-08 23:46:30 +03:00
} ;
2008-07-17 20:53:50 +04:00
struct btrfs_ordered_sum {
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 00:58:54 +03:00
/* bytenr is the start of this extent on disk */
u64 bytenr ;
2008-07-18 14:17:13 +04:00
/*
* this is the length in bytes covered by the sums array below .
*/
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 06:36:09 +04:00
int len ;
2008-07-17 20:53:50 +04:00
struct list_head list ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 06:36:09 +04:00
/* last field is a variable length array of csums */
2019-05-22 11:19:01 +03:00
u8 sums [ ] ;
2008-07-17 20:53:50 +04:00
} ;
2008-07-17 21:53:27 +04:00
/*
2021-01-21 09:13:54 +03:00
* Bits for btrfs_ordered_extent : : flags .
2008-07-17 21:53:27 +04:00
*
* BTRFS_ORDERED_IO_DONE is set when all of the blocks are written .
* It is used to make sure metadata is inserted into the tree only once
* per extent .
*
* BTRFS_ORDERED_COMPLETE is set when the extent is removed from the
* rbtree , just before waking any waiters . It is used to indicate the
* IO is done and any metadata is inserted into the tree .
*/
2018-11-27 17:17:47 +03:00
enum {
2021-01-21 09:13:54 +03:00
/*
2021-02-11 11:14:05 +03:00
* Different types for ordered extents , one and only one of the 4 types
* need to be set when creating ordered extent .
2021-01-21 09:13:54 +03:00
*
* REGULAR : For regular non - compressed COW write
* NOCOW : For NOCOW write into existing non - hole extent
* PREALLOC : For NOCOW write into preallocated extent
* COMPRESSED : For compressed COW write
*/
BTRFS_ORDERED_REGULAR ,
BTRFS_ORDERED_NOCOW ,
BTRFS_ORDERED_PREALLOC ,
BTRFS_ORDERED_COMPRESSED ,
/*
* Extra bit for direct io , can only be set for
* REGULAR / NOCOW / PREALLOC . No direct io for compressed extent .
*/
BTRFS_ORDERED_DIRECT ,
/* Extra status bits for ordered extents */
2018-11-27 17:17:47 +03:00
/* set when all the pages are written */
BTRFS_ORDERED_IO_DONE ,
/* set when removed from the tree */
BTRFS_ORDERED_COMPLETE ,
/* We had an io error when writing this out */
BTRFS_ORDERED_IOERR ,
/* Set when we have to truncate an extent */
BTRFS_ORDERED_TRUNCATED ,
btrfs: make fast fsyncs wait only for writeback
Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.
Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:
https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.
When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.
This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.
Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).
The following script that calls fio was used:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd -o space_cache=v2"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 5 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
WRITE_MODE=$5
if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
exit 1
fi
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=$WRITE_MODE
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The results were the following:
*************************
*** sequential writes ***
*************************
==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
After patch:
WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)
==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
After patch:
WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)
==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
After patch:
WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
After patch:
WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)
==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
After patch:
WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)
==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
After patch:
WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)
==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
After patch:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)
************************
*** random writes ***
************************
==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
After patch:
WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)
==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
After patch:
WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)
==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
After patch:
WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
After patch:
WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)
==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
After patch:
WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)
==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
After patch:
WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)
==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
After patch:
WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-11 14:43:58 +03:00
/* Used during fsync to track already logged extents */
BTRFS_ORDERED_LOGGED ,
/* We have already logged all the csums of the ordered extent */
BTRFS_ORDERED_LOGGED_CSUM ,
/* We wait for this extent to complete in the current transaction */
BTRFS_ORDERED_PENDING ,
2019-08-14 02:00:02 +03:00
/* BTRFS_IOC_ENCODED_WRITE */
BTRFS_ORDERED_ENCODED ,
2018-11-27 17:17:47 +03:00
} ;
2017-02-14 02:35:09 +03:00
2019-11-06 23:11:56 +03:00
/* BTRFS_ORDERED_* flags that specify the type of the extent. */
# define BTRFS_ORDERED_TYPE_FLAGS ((1UL << BTRFS_ORDERED_REGULAR) | \
( 1UL < < BTRFS_ORDERED_NOCOW ) | \
( 1UL < < BTRFS_ORDERED_PREALLOC ) | \
( 1UL < < BTRFS_ORDERED_COMPRESSED ) | \
2019-08-14 02:00:02 +03:00
( 1UL < < BTRFS_ORDERED_DIRECT ) | \
( 1UL < < BTRFS_ORDERED_ENCODED ) )
2019-11-06 23:11:56 +03:00
2008-07-17 20:53:50 +04:00
struct btrfs_ordered_extent {
2008-07-17 21:53:27 +04:00
/* logical offset in the file */
2008-07-17 20:53:50 +04:00
u64 file_offset ;
2008-07-17 21:53:27 +04:00
2019-12-03 04:34:19 +03:00
/*
* These fields directly correspond to the same fields in
* btrfs_file_extent_item .
*/
u64 num_bytes ;
2019-11-06 23:11:56 +03:00
u64 ram_bytes ;
u64 disk_bytenr ;
2019-12-03 04:34:19 +03:00
u64 disk_num_bytes ;
2019-11-06 23:11:56 +03:00
u64 offset ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 21:49:59 +03:00
2009-09-03 00:53:46 +04:00
/* number of bytes that still need writing */
u64 bytes_left ;
2012-09-06 14:01:21 +04:00
/*
* the end of the ordered extent which is behind it but
* didn ' t update disk_i_size . Please see the comment of
* btrfs_ordered_update_i_size ( ) ;
*/
u64 outstanding_isize ;
2013-08-29 21:57:21 +04:00
/*
* If we get truncated we need to adjust the file extent we enter for
* this ordered extent so that we do not expose stale data .
*/
u64 truncated_len ;
2008-07-17 21:53:27 +04:00
/* flags (described above) */
2008-07-17 20:53:50 +04:00
unsigned long flags ;
2008-07-17 21:53:27 +04:00
2010-12-17 09:21:50 +03:00
/* compression algorithm */
int compress_type ;
btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak
[BUG]
The following simple workload from fsstress can lead to qgroup reserved
data space leak:
0/0: creat f0 x:0 0 0
0/0: creat add id=0,parent=-1
0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
This would cause btrfs qgroup to leak 20480 bytes for data reserved
space. If btrfs qgroup limit is enabled, such leak can lead to
unexpected early EDQUOT and unusable space.
[CAUSE]
When doing direct IO, kernel will try to writeback existing buffered
page cache, then invalidate them:
generic_file_direct_write()
|- filemap_write_and_wait_range();
|- invalidate_inode_pages2_range();
However for btrfs, the bi_end_io hook doesn't finish all its heavy work
right after bio ends. In fact, it delays its work further:
submit_extent_page(end_io_func=end_bio_extent_writepage);
end_bio_extent_writepage()
|- btrfs_writepage_endio_finish_ordered()
|- btrfs_init_work(finish_ordered_fn);
<<< Work queue execution >>>
finish_ordered_fn()
|- btrfs_finish_ordered_io();
|- Clear qgroup bits
This means, when filemap_write_and_wait_range() returns,
btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
qgroup bits for related range are not cleared.
Now into how the leak happens, this will only focus on the overlapping
part of buffered and direct IO part.
1. After buffered write
The inode had the following range with QGROUP_RESERVED bit:
596 616K
|///////////////|
Qgroup reserved data space: 20K
2. Writeback part for range [596K, 616K)
Write back finished, but btrfs_finish_ordered_io() not get called
yet.
So we still have:
596K 616K
|///////////////|
Qgroup reserved data space: 20K
3. Pages for range [596K, 616K) get released
This will clear all qgroup bits, but don't update the reserved data
space.
So we have:
596K 616K
| |
Qgroup reserved data space: 20K
That number doesn't match the qgroup bit range anymore.
4. Dio prepare space for range [596K, 700K)
Qgroup reserved data space for that range, we got:
596K 616K 700K
|///////////////|///////////////////////|
Qgroup reserved data space: 20K + 104K = 124K
5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
Qgroup free reserved space for that range, we got:
596K 616K 700K
| |///////////////////////|
We need to free that range of reserved space.
Qgroup reserved data space: 124K - 20K = 104K
6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
However qgroup bit for range [596K, 616K) is already cleared in
previous step, so we only free 84K for qgroup reserved space.
596K 616K 700K
| | |
We need to free that range of reserved space.
Qgroup reserved data space: 104K - 84K = 20K
Now there is no way to release that 20K unless disabling qgroup or
unmounting the fs.
[FIX]
This patch will change the timing of btrfs_qgroup_release/free_data()
call. Here it uses buffered COW write as an example.
The new timing | The old timing
----------------------------------------+---------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- btrfs_qgroup_reserve_data() | |- btrfs_qgroup_reserve_data()
|
btrfs_run_delalloc_range() | btrfs_run_delalloc_range()
|- btrfs_add_ordered_extent() |
|- btrfs_qgroup_release_data() |
The reserved is passed into |
btrfs_ordered_extent structure |
|
btrfs_finish_ordered_io() | btrfs_finish_ordered_io()
|- The reserved space is passed to | |- btrfs_qgroup_release_data()
btrfs_qgroup_record | The resereved space is passed
| to btrfs_qgroup_recrod
|
btrfs_qgroup_account_extents() | btrfs_qgroup_account_extents()
|- btrfs_qgroup_free_refroot() | |- btrfs_qgroup_free_refroot()
The point of such change is to ensure, when ordered extents are
submitted, the qgroup reserved space is already released, to keep the
timing aligned with file_write_and_wait_range().
So that qgroup data reserved space is all bound to btrfs_ordered_extent
and solve the timing mismatch.
Fixes: f695fdcef83a ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-10 04:04:43 +03:00
/* Qgroup reserved space */
int qgroup_rsv ;
2008-07-17 21:53:27 +04:00
/* reference count */
2017-03-03 11:55:13 +03:00
refcount_t refs ;
2008-07-17 21:53:27 +04:00
2008-07-24 19:57:52 +04:00
/* the inode we belong to */
struct inode * inode ;
2008-07-17 21:53:27 +04:00
/* list of checksums for insertion when the extent io is done */
2008-07-17 20:53:50 +04:00
struct list_head list ;
2008-07-17 21:53:27 +04:00
btrfs: make fast fsyncs wait only for writeback
Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.
Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:
https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.
When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.
This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.
Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).
The following script that calls fio was used:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd -o space_cache=v2"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 5 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
WRITE_MODE=$5
if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
exit 1
fi
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=$WRITE_MODE
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The results were the following:
*************************
*** sequential writes ***
*************************
==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
After patch:
WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)
==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
After patch:
WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)
==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
After patch:
WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
After patch:
WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)
==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
After patch:
WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)
==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
After patch:
WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)
==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
After patch:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)
************************
*** random writes ***
************************
==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
After patch:
WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)
==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
After patch:
WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)
==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
After patch:
WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
After patch:
WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)
==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
After patch:
WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)
==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
After patch:
WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)
==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
After patch:
WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-11 14:43:58 +03:00
/* used for fast fsyncs */
struct list_head log_list ;
2008-07-17 21:53:27 +04:00
/* used to wait for the BTRFS_ORDERED_COMPLETE bit */
2008-07-17 20:53:50 +04:00
wait_queue_head_t wait ;
2008-07-17 21:53:27 +04:00
/* our friendly rbtree entry */
2008-07-17 20:53:50 +04:00
struct rb_node rb_node ;
2008-07-24 19:57:52 +04:00
/* a per root list of all the pending ordered extents */
struct list_head root_extent_list ;
2012-05-02 22:00:54 +04:00
2014-02-28 06:46:19 +04:00
struct btrfs_work work ;
2008-07-17 20:53:50 +04:00
2012-10-25 13:41:36 +04:00
struct completion completion ;
2014-02-28 06:46:19 +04:00
struct btrfs_work flush_work ;
2012-10-25 13:41:36 +04:00
struct list_head work_list ;
2021-02-04 13:22:05 +03:00
/*
* Used to reverse - map physical address returned from ZONE_APPEND write
* command in a workqueue context
*/
u64 physical ;
2021-07-22 10:53:59 +03:00
struct block_device * bdev ;
2012-10-25 13:41:36 +04:00
} ;
2008-07-17 20:53:50 +04:00
2008-01-08 23:46:30 +03:00
static inline void
btrfs_ordered_inode_tree_init ( struct btrfs_ordered_inode_tree * t )
{
2010-02-03 00:48:28 +03:00
spin_lock_init ( & t - > lock ) ;
2010-02-23 22:43:04 +03:00
t - > tree = RB_ROOT ;
2008-07-17 20:53:50 +04:00
t - > last = NULL ;
2008-01-08 23:46:30 +03:00
}
2022-06-19 09:07:05 +03:00
int btrfs_finish_ordered_io ( struct btrfs_ordered_extent * ordered_extent ) ;
2012-03-01 17:56:26 +04:00
void btrfs_put_ordered_extent ( struct btrfs_ordered_extent * entry ) ;
2020-09-18 12:15:50 +03:00
void btrfs_remove_ordered_extent ( struct btrfs_inode * btrfs_inode ,
2008-07-17 20:53:50 +04:00
struct btrfs_ordered_extent * entry ) ;
btrfs: refactor how we finish ordered extent io for endio functions
Btrfs has two endio functions to mark certain io range finished for
ordered extents:
- __endio_write_update_ordered()
This is for direct IO
- btrfs_writepage_endio_finish_ordered()
This for buffered IO.
However they go different routines to handle ordered extent io:
- Whether to iterate through all ordered extents
__endio_write_update_ordered() will but
btrfs_writepage_endio_finish_ordered() will not.
In fact, iterating through all ordered extents will benefit later
subpage support, while for current PAGE_SIZE == sectorsize requirement
this behavior makes no difference.
- Whether to update page Private2 flag
__endio_write_update_ordered() will not update page Private2 flag as
for iomap direct IO, the page can not be even mapped.
While btrfs_writepage_endio_finish_ordered() will clear Private2 to
prevent double accounting against btrfs_invalidatepage().
Those differences are pretty subtle, and the ordered extent iterations
code in callers makes code much harder to read.
So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting:
- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent
This function has two new feature:
- Proper underflow detection and recovery
The old underflow detection will only detect the problem, then
continue.
No proper info like root/inode/ordered extent info, nor noisy enough
to be caught by fstests.
Furthermore when underflow happens, the ordered extent will never
finish.
New error detection will reset the bytes_left to 0, do proper
kernel warning, and output extra info including root, ino, ordered
extent range, the underflow value.
- Prevent double accounting based on Private2 flag
Now if we find a range without Private2 flag, we will skip to next
range.
As that means someone else has already finished the accounting of
ordered extent.
This makes no difference for current code, but will be a critical part
for incoming subpage support, as we can call
btrfs_mark_ordered_io_finished() for multiple sectors if they are
beyond inode size.
Thus such double accounting prevention is a key feature for subpage.
Now both endio functions only need to call that new function.
And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-01 10:15:06 +03:00
void btrfs_mark_ordered_io_finished ( struct btrfs_inode * inode ,
struct page * page , u64 file_offset ,
2022-06-19 09:07:05 +03:00
u64 num_bytes , bool uptodate ) ;
2020-12-22 08:59:24 +03:00
bool btrfs_dec_test_ordered_pending ( struct btrfs_inode * inode ,
struct btrfs_ordered_extent * * cached ,
2021-07-26 15:15:10 +03:00
u64 file_offset , u64 io_size ) ;
2020-06-03 08:55:13 +03:00
int btrfs_add_ordered_extent ( struct btrfs_inode * inode , u64 file_offset ,
2019-11-06 23:11:56 +03:00
u64 num_bytes , u64 ram_bytes , u64 disk_bytenr ,
u64 disk_num_bytes , u64 offset , unsigned flags ,
int compress_type ) ;
2019-04-10 16:16:11 +03:00
void btrfs_add_ordered_sum ( struct btrfs_ordered_extent * entry ,
2012-03-01 17:56:26 +04:00
struct btrfs_ordered_sum * sum ) ;
2020-06-03 08:55:03 +03:00
struct btrfs_ordered_extent * btrfs_lookup_ordered_extent ( struct btrfs_inode * inode ,
2008-07-17 20:53:50 +04:00
u64 file_offset ) ;
2020-09-18 12:15:53 +03:00
void btrfs_start_ordered_extent ( struct btrfs_ordered_extent * entry , int wait ) ;
2013-10-26 00:13:35 +04:00
int btrfs_wait_ordered_range ( struct inode * inode , u64 start , u64 len ) ;
2008-07-17 20:53:50 +04:00
struct btrfs_ordered_extent *
2020-08-31 14:42:39 +03:00
btrfs_lookup_first_ordered_extent ( struct btrfs_inode * inode , u64 file_offset ) ;
2021-04-27 10:03:40 +03:00
struct btrfs_ordered_extent * btrfs_lookup_first_ordered_range (
struct btrfs_inode * inode , u64 file_offset , u64 len ) ;
2017-02-20 14:50:49 +03:00
struct btrfs_ordered_extent * btrfs_lookup_ordered_range (
struct btrfs_inode * inode ,
u64 file_offset ,
u64 len ) ;
btrfs: make fast fsyncs wait only for writeback
Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.
Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:
https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.
When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.
This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.
Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).
The following script that calls fio was used:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd -o space_cache=v2"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 5 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
WRITE_MODE=$5
if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
exit 1
fi
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=$WRITE_MODE
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The results were the following:
*************************
*** sequential writes ***
*************************
==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
After patch:
WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)
==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
After patch:
WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)
==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
After patch:
WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
After patch:
WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)
==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
After patch:
WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)
==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
After patch:
WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)
==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
After patch:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)
************************
*** random writes ***
************************
==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
After patch:
WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)
==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
After patch:
WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)
==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
After patch:
WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
After patch:
WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)
==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
After patch:
WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)
==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
After patch:
WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)
==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
After patch:
WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-11 14:43:58 +03:00
void btrfs_get_ordered_extents_for_logging ( struct btrfs_inode * inode ,
struct list_head * list ) ;
2017-06-23 19:48:21 +03:00
u64 btrfs_wait_ordered_extents ( struct btrfs_root * root , u64 nr ,
2016-04-26 17:36:38 +03:00
const u64 range_start , const u64 range_len ) ;
Btrfs: fix block group remaining RO forever after error during device replace
When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
set the block group to RO mode and then wait for any ongoing writes into
extents of the block group to complete. While doing that wait we overwrite
the value of the variable 'ret' and can break out of the loop if an error
happens without turning the block group back into RW mode. So what happens
is the following:
1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
to RO mode (its ->ro field set to 1 or incremented to some value > 1);
2) Then btrfs_wait_ordered_roots() returns a value > 0;
3) Then if either joining or committing the transaction fails, we break
out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
the block group in RO mode forever.
To fix this, just remove the code that waits for ongoing writes to extents
of the block group, since it's not needed because in the initial setup
phase of a device replace operation, before starting to find all chunks
and their extents, we set the target device for replace while holding
fs_info->dev_replace->rwsem, which ensures that after releasing that
semaphore, any writes into the source device are made to the target device
as well (__btrfs_map_block() guarantees that). So while at
scrub_enumerate_chunks() we only need to worry about finding and copying
extents (from the source device to the target device) that were written
before we started the device replace operation.
Fixes: f0e9b7d6401959 ("Btrfs: fix race setting block group readonly during device replace")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-14 21:02:43 +03:00
void btrfs_wait_ordered_roots ( struct btrfs_fs_info * fs_info , u64 nr ,
2016-04-26 17:36:38 +03:00
const u64 range_start , const u64 range_len ) ;
2020-02-05 21:09:33 +03:00
void btrfs_lock_and_flush_ordered_range ( struct btrfs_inode * inode , u64 start ,
2019-05-07 10:19:22 +03:00
u64 end ,
struct extent_state * * cached_state ) ;
2022-09-30 23:45:10 +03:00
bool btrfs_try_lock_ordered_range ( struct btrfs_inode * inode , u64 start , u64 end ,
struct extent_state * * cached_state ) ;
2021-02-04 13:22:00 +03:00
int btrfs_split_ordered_extent ( struct btrfs_ordered_extent * ordered , u64 pre ,
u64 post ) ;
2012-09-06 14:01:51 +04:00
int __init ordered_data_init ( void ) ;
2018-02-19 19:24:18 +03:00
void __cold ordered_data_exit ( void ) ;
2018-04-03 20:16:55 +03:00
2008-01-08 23:46:30 +03:00
# endif