linux

iv/linux

Author	SHA1	Message	Date
Christian Brauner	4d4340c912	btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	3bc71ba02c	btrfs: allow idmapped permission inode op Enable btrfs_permission() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	d4d0946461	btrfs: allow idmapped setattr inode op Enable btrfs_setattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	98b6ab5fc0	btrfs: allow idmapped tmpfile inode op Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	5a0521086e	btrfs: allow idmapped symlink inode op Enable btrfs_symlink() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	b0b3e44d34	btrfs: allow idmapped mkdir inode op Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	e93ca491d0	btrfs: allow idmapped create inode op Enable btrfs_create() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	72105277dc	btrfs: allow idmapped mknod inode op Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	c020d2eaf1	btrfs: allow idmapped getattr inode op Enable btrfs_getattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	ca07274c3d	btrfs: allow idmapped rename inode op Enable btrfs_rename() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	b3b6f5b922	btrfs: handle idmaps in btrfs_new_inode() Extend btrfs_new_inode() to take the idmapped mount into account when initializing a new inode. This is just a matter of passing down the mount's userns. The rest is taken care of in inode_init_owner(). This is a preliminary patch to make the individual btrfs inode operations idmapped mount aware. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Naohiro Aota	63fb5879db	btrfs: zoned: add asserts on splitting extent_map We call split_zoned_em() on an extent_map on submitting a bio for it. Thus, we can assume the extent_map is PINNED, not LOGGING, and in the modified list. Add ASSERT()s to ensure the extent_maps after the split also has the proper flags set and are in the modified list. Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Filipe Manana	1c167b87f4	btrfs: remove unnecessary NULL check for the new inode during rename exchange At the very end of btrfs_rename_exchange(), in case an error happened, we are checking if 'new_inode' is NULL, but that is not needed since during a rename exchange, unlike regular renames, 'new_inode' can never be NULL, and if it were, we would have a crashed much earlier when we dereference it multiple times. So remove the check because it is not necessary and because it is causing static checkers to emit a warning. I probably introduced the check by copy-pasting similar code from btrfs_rename(), where 'new_inode' can be NULL, in commit 86e8aa0e772cab ("Btrfs: unpin logs if rename exchange operation fails"). Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Boris Burkov	705242538f	btrfs: verity metadata orphan items Writing out the verity data is too large of an operation to do in a single transaction. If we are interrupted before we finish creating fsverity metadata for a file, or fail to clean up already created metadata after a failure, we could leak the verity items that we already committed. To address this issue, we use the orphan mechanism. When we start enabling verity on a file, we also add an orphan item for that inode. When we are finished, we delete the orphan. However, if we are interrupted midway, the orphan will be present at mount and we can cleanup the half-formed verity state. There is a possible race with a normal unlink operation: if unlink and verity run on the same file in parallel, it is possible for verity to succeed and delete the still legitimate orphan added by unlink. Then, if we are interrupted and mount in that state, we will never clean up the inode properly. This is also possible for a file created with O_TMPFILE. Check nlink==0 before deleting to avoid this race. A final thing to note is that this is a resurrection of using orphans to signal an operation besides "delete this inode". The old case was to signal the need to do a truncate. That case still technically applies for mounting very old file systems, so we need to take some care to not clobber it. To that end, we just have to be careful that verity orphan cleanup is a no-op for non-verity files. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	146054090b	btrfs: initial fsverity support Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	77eea05e78	btrfs: add ro compat flags to inodes Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Qu Wenruo	7361b4ae03	btrfs: remove the dead comment in writepage_delalloc() When btrfs_run_delalloc_range() failed, we will error out. But there is a strange comment mentioning that btrfs_run_delalloc_range() could have returned value >0 to indicate the IO has already started. Commit 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack usage") introduced the comment, but unfortunately at that time, we were already using @page_started to indicate that case, and still return 0. Furthermore, even if that comment was right (which is not), we would return -EIO if the IO had already started. By all means the comment is incorrect, just remove the comment along with the dead check. Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to make sure we either return 0 or error, no positive return value. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Filipe Manana	bd54f381a1	btrfs: do not pin logs too early during renames During renames we pin the logs of the roots a bit too early, before the calls to btrfs_insert_inode_ref(). We can pin the logs after those calls, since those will not change anything in a log tree. In a scenario where we have multiple and diverse filesystem operations running in parallel, those calls can take a significant amount of time, due to lock contention on extent buffers, and delay log commits from other tasks for longer than necessary. So just pin logs after calls to btrfs_insert_inode_ref() and right before the first operation that can update a log tree. The following script that uses dbench was used for testing: $ cat dbench-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" \| tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 120 16 umount $MNT The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device and using a non-debug kernel config (Debian's default config). The results compare a branch without this patch and without the previous patch in the series, that has the subject: "btrfs: eliminate some false positives when checking if inode was logged" Versus the same branch with these two patches applied. dbench with 8 clients, results before: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4391359 0.009 249.745 Close 3225882 0.001 3.243 Rename 185953 0.065 240.643 Unlink 886669 0.049 249.906 Deltree 112 2.455 217.433 Mkdir 56 0.002 0.004 Qpathinfo 3980281 0.004 3.109 Qfileinfo 697579 0.001 0.187 Qfsinfo 729780 0.002 2.424 Sfileinfo 357764 0.004 1.415 Find 1538861 0.016 4.863 WriteX 2189666 0.010 3.327 ReadX 6883443 0.002 0.729 LockX 14298 0.002 0.073 UnlockX 14298 0.001 0.042 Flush 307777 2.447 303.663 Throughput 1149.6 MB/sec 8 clients 8 procs max_latency=303.666 ms dbench with 8 clients, results after: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4269920 0.009 213.532 Close 3136653 0.001 0.690 Rename 180805 0.082 213.858 Unlink 862189 0.050 172.893 Deltree 112 2.998 218.328 Mkdir 56 0.002 0.003 Qpathinfo 3870158 0.004 5.072 Qfileinfo 678375 0.001 0.194 Qfsinfo 709604 0.002 0.485 Sfileinfo 347850 0.004 1.304 Find 1496310 0.017 5.504 WriteX 2129613 0.010 2.882 ReadX 6693066 0.002 1.517 LockX 13902 0.002 0.075 UnlockX 13902 0.001 0.055 Flush 299276 2.511 220.189 Throughput 1187.33 MB/sec 8 clients 8 procs max_latency=220.194 ms +3.2% throughput, -31.8% max latency dbench with 16 clients, results before: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5978334 0.028 156.507 Close 4391598 0.001 1.345 Rename 253136 0.241 155.057 Unlink 1207220 0.182 257.344 Deltree 160 6.123 36.277 Mkdir 80 0.003 0.005 Qpathinfo 5418817 0.012 6.867 Qfileinfo 949929 0.001 0.941 Qfsinfo 993560 0.002 1.386 Sfileinfo 486904 0.004 2.829 Find 2095088 0.059 8.164 WriteX 2982319 0.017 9.029 ReadX 9371484 0.002 4.052 LockX 19470 0.002 0.461 UnlockX 19470 0.001 0.990 Flush 418936 2.740 347.902 Throughput 1495.31 MB/sec 16 clients 16 procs max_latency=347.909 ms dbench with 16 clients, results after: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5711833 0.029 131.240 Close 4195897 0.001 1.732 Rename 241849 0.204 147.831 Unlink 1153341 0.184 231.322 Deltree 160 6.086 30.198 Mkdir 80 0.003 0.021 Qpathinfo 5177011 0.012 7.150 Qfileinfo 907768 0.001 0.793 Qfsinfo 949205 0.002 1.431 Sfileinfo 465317 0.004 2.454 Find 2001541 0.058 7.819 WriteX 2850661 0.017 9.110 ReadX 8952289 0.002 3.991 LockX 18596 0.002 0.655 UnlockX 18596 0.001 0.179 Flush 400342 2.879 293.607 Throughput 1565.73 MB/sec 16 clients 16 procs max_latency=293.611 ms +4.6% throughput, -16.9% max latency Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Naohiro Aota	42b5d73b5d	btrfs: drop unnecessary ASSERT from btrfs_submit_direct() When on SINGLE block group, btrfs_get_io_geometry() will return "the size of the block group - the offset of the logical address within the block group" as geom.len. Since we allow up to 8 GiB zone size on zoned filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <= INT_MAX);". The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim() which both take "int" (now u64 due to the previous patch). So to be precise the ASSERT should check if clone_len <= UINT_MAX. But actually, clone_len is already capped by bio.bi_iter.bi_size which is unsigned int. So the ASSERT is not necessary. Drop the ASSERT and properly compare submit_len and geom.len in u64. Then, let the implicit casting to convert it to u64. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Josef Bacik	b377630527	btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking sync_inode() has some holes that can cause problems if we're under heavy ENOSPC pressure. If there's writeback running on a separate thread sync_inode() will skip writing the inode altogether. What we really want is to make sure writeback has been started on all the pages to make sure we can see the ordered extents and wait on them if appropriate. Switch to this new helper which will allow us to accomplish this and avoid ENOSPC'ing early. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00
Josef Bacik	e16460707e	btrfs: wait on async extents when flushing delalloc I've been debugging an early ENOSPC problem in production and finally root caused it to this problem. When we switched to the per-inode in 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") I pulled out the async extent handling, because we were doing the correct thing by calling filemap_flush() if we had async extents set. This would properly wait on any async extents by locking the page in the second flush, thus making sure our ordered extents were properly set up. However when I switched us back to page based flushing, I used sync_inode(), which allows us to pass in our own wbc. The problem here is that sync_inode() is smarter than the filemap_* helpers, it tries to avoid calling writepages at all. This means that our second call could skip calling do_writepages altogether, and thus not wait on the pagelock for the async helpers. This means we could come back before any ordered extents were created and then simply continue on in our flushing mechanisms and ENOSPC out when we have plenty of space to use. Fix this by putting back the async pages logic in shrink_delalloc. This allows us to bulk write out everything that we need to, and then we can wait in one place for the async helpers to catch up, and then wait on any ordered extents that are created. Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00
Josef Bacik	ac98141d14	btrfs: wake up async_delalloc_pages waiters after submit We use the async_delalloc_pages mechanism to make sure that we've completed our async work before trying to continue our delalloc flushing. The reason for this is we need to see any ordered extents that were created by our delalloc flushing. However we're waking up before we do the submit work, which is before we create the ordered extents. This is a pretty wide race window where we could potentially think there are no ordered extents and thus exit shrink_delalloc prematurely. Fix this by waking us up after we've done the work to create ordered extents. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:06 +02:00
Qu Wenruo	95ea0486b2	btrfs: allow read-write for 4K sectorsize on 64K page size systems Since now we support data and metadata read-write for subpage, remove the RO requirement for subpage mount. There are some extra limitations though: - For now, subpage RW mount is still considered experimental Thus that mount warning will still be there. - No compression support There are still quite some PAGE_SIZE hard coded and quite some call sites use extent_clear_unlock_delalloc() to unlock locked_page. This will screw up subpage helpers. Now for subpage RW mount, no matter what mount option or inode attr is set, all writes will not be compressed. Although reading compressed data has no problem. - No defrag for subpage case The defrag support for subpage case will come in later patches, which will also rework the defrag workflow. - No inline extent will be created This is mostly due to the fact that filemap_fdatawrite_range() will trigger more write than the range specified. In fallocate calls, this behavior can make us to writeback which can be inlined, before we enlarge the i_size. This is a very special corner case, and even current btrfs check won't report error on such inline extent + regular extent. But considering how much effort has been put to prevent such inline + regular, I'd prefer to cut off inline extent completely until we have a good solution. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:06 +02:00
Qu Wenruo	e3c62324e4	btrfs: subpage: fix false alert when relocating partial preallocated data extents [BUG] When relocating partial preallocated data extents (part of the preallocated extent is written) for subpage, it can cause the following false alert and make the relocation to fail: BTRFS info (device dm-3): balance: start -d BTRFS info (device dm-3): relocating block group 13631488 flags data BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1 BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1 BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 BTRFS info (device dm-3): balance: ended with status: -5 The minimal script to reproduce looks like this: mkfs.btrfs -f -s 4k $dev mount $dev -o nospace_cache $mnt xfs_io -f -c "falloc 0 8k" $mnt/file xfs_io -f -c "pwrite 0 4k" $mnt/file btrfs balance start -d $mnt [CAUSE] Function btrfs_verify_data_csum() checks if the full range has EXTENT_NODATASUM bit for data reloc inode, if all bytes of the range have EXTENT_NODATASUM bit, then it skip the range. This works pretty well for regular sectorsize, as in that case btrfs_verify_data_csum() is called for each sector, thus no problem at all. But for subpage case, btrfs_verify_data_csum() is called on each bvec, which can contain several sectors, and since it checks all bytes for EXTENT_NODATASUM bit, if we have some range with csum, then we will continue checking all the sectors. For the preallocated sectors, it doesn't have any csum, thus obviously the csum won't match and cause the false alert. [FIX] Move the EXTENT_NODATASUM check into the main loop, so that we can check each sector for EXTENT_NODATASUM bit for subpage case. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:05 +02:00
Qu Wenruo	7c11d0ae43	btrfs: subpage: fix a potential use-after-free in writeback helper [BUG] There is a possible use-after-free bug when running generic/095. BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b Faulting instruction address: 0xc000000000283654 c000000000283078 do_raw_spin_unlock+0x88/0x230 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0 c0000000009e0458 end_bio_extent_writepage+0x158/0x270 c000000000b6fd14 bio_endio+0x254/0x270 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200 c000000000b6fd14 bio_endio+0x254/0x270 c000000000b781fc blk_update_request+0x46c/0x670 c000000000b8b394 blk_mq_end_request+0x34/0x1d0 c000000000d82d1c lo_complete_rq+0x11c/0x140 c000000000b880a4 blk_complete_reqs+0x84/0xb0 c0000000012b2ca4 __do_softirq+0x334/0x680 c0000000001dd878 irq_exit+0x148/0x1d0 c000000000016f4c do_IRQ+0x20c/0x240 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0 [CAUSE] There is very small race window like the following in generic/095. Thread 1 \| Thread 2 --------------------------------+------------------------------------ end_bio_extent_writepage() \| btrfs_releasepage() \|- spin_lock_irqsave() \| \| \|- end_page_writeback() \| \| \| \| \|- if (PageWriteback() \|\|...) \| \| \|- clear_page_extent_mapped() \| \| \|- kfree(subpage); \|- spin_unlock_irqrestore(). The race can also happen between writeback and btrfs_invalidatepage(), although that would be much harder as btrfs_invalidatepage() has much more work to do before the clear_page_extent_mapped() call. [FIX] Here we "wait" for the subapge spinlock to be released before we detach subpage structure. So this patch will introduce a new function, wait_subpage_spinlock(), to do the "wait" by acquiring the spinlock and release it. Since the caller has ensured the page is not dirty nor writeback, and page is already locked, the only way to hold the subpage spinlock is from endio function. Thus we only need to acquire the spinlock to wait for any existing holder. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:05 +02:00
Qu Wenruo	7367253a35	btrfs: subpage: disable inline extent creation [BUG] When running the following fsx command (extracted from generic/127) on subpage filesystem, it can create inline extent with regular extents: fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx The offending extent would look like: item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14 index 2 namelen 4 name: file item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728 generation 7 type 0 (inline) inline extent data size 707 ram_bytes 707 compression 0 (none) item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53 generation 7 type 2 (prealloc) prealloc data disk byte 102346752 nr 4096 prealloc data offset 0 nr 4096 [CAUSE] For subpage filesystem, the writeback is triggered in page units, which means, even if we just want to writeback range [16K, 20K) for 64K page system, we will still try to writeback any dirty sector of range [0, 64K). This is never a problem if sectorsize == PAGE_SIZE, but for subpage, this can cause unexpected problems. For above test case, the last several operations from fsx are: 9055 trunc from 0x40000 to 0x2c3 9057 falloc from 0x164c to 0x19d2 (0x386 bytes) In operation 9055, we dirtied sector [0, 4096), then in falloc, we call btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to writeback any dirty data in [4096, 8192), but nothing else. Unfortunately, in subpage case, above btrfs_wait_ordered_range() will trigger writeback of the range [0, 64K), which includes the data at [0, 4096). And since at the call site, we haven't yet increased i_size, which is still 707, this means cow_file_range() can insert an inline extent. Resulting above inline + regular extent. [WORKAROUND] I don't really have any good short-term solution yet, as this means all operations that would trigger writeback need to be reviewed for any i_size change. So here I choose to disable inline extent creation for subpage case as a workaround. We have done tons of work just to avoid such extent, so I don't to create an exception just for subpage. This only affects inline extent creation, subpage has no problem reading existing inline extents at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:05 +02:00
Qu Wenruo	3670e6451b	btrfs: subpage: check if there are compressed extents inside one page [BUG] When testing experimental subpage compressed write support, it hits a NULL pointer dereference inside read path: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 pc : __pi_memcmp+0x28/0x1ec lr : check_data_csum+0xd0/0x274 [btrfs] Call trace: __pi_memcmp+0x28/0x1ec btrfs_verify_data_csum+0xf4/0x244 [btrfs] end_bio_extent_readpage+0x1d0/0x6b0 [btrfs] bio_endio+0x15c/0x1dc end_workqueue_fn+0x44/0x64 [btrfs] btrfs_work_helper+0x74/0x250 [btrfs] process_one_work+0x1d4/0x47c worker_thread+0x180/0x400 kthread+0x11c/0x120 ret_from_fork+0x10/0x30 Code: 54000261 d100044c d343fd8c f8408403 (f8408424) ---[ end trace 9e2c59f33ea40866 ]--- [CAUSE] When reading two compressed extents inside the same page, like the following layout, we trigger above crash: 0 32K 64K \|-------\|\\\\\\\\| \| \- Compressed extent (A) \--------- Compressed extent (B) For compressed read, we don't need to populate its io_bio->csum, as we rely on compressed_bio->csum to verify the compressed data, and then copy the decompressed to inode pages. Normally btrfs_verify_data_csum() skip such page by checking and clearing its PageChecked flag But since that flag is still for the full page, when endio for inode page range [0, 32K) gets executed, it clears PageChecked flag for the full page. Then when endio for inode page range [32K, 64K) gets executed, since the page no longer has PageChecked flag, it just continues checking, even though io_bio->csum is NULL. [FIX] Thankfully there are only two users of PageChecked bit: - Cow fixup Since subpage has its own way to trace page dirty (dirty_bitmap) and ordered bit (ordered_bitmap), it should never trigger cow fixup. - Compressed read We can distinguish such read by just checking io_bio->csum. So just check io_bio->csum before doing the verification to avoid such NULL pointer dereference. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:03 +02:00
David Sterba	f41b6ba93d	btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pending In commit e65f152e4348 ("btrfs: refactor how we finish ordered extent io for endio functions") there was last caller not using 1 for the uptodate parameter. Now there's only one, passing 1, so we can remove it and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:02 +02:00
David Sterba	25c1252a02	btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered The uptodate parameter should be bool, change the type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:02 +02:00
Qu Wenruo	a129ffb816	btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() Since commit d75855b4518b ("btrfs: Remove extent_io_ops::writepage_start_hook") removes the writepage_start_hook() and adds btrfs_writepage_cow_fixup() function, there is no need to follow the old hook parameters. Remove the @start and @end hook, since currently the fixup check is full page check, it doesn't need @start and @end hook. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:02 +02:00
Filipe Manana	cceaa89f02	btrfs: remove racy and unnecessary inode transaction update when using no-holes When using the NO_HOLES feature and expanding the size of an inode, we update the inode's last_trans, last_sub_trans and last_log_commit fields at maybe_insert_hole() so that a fsync does know that the inode needs to be logged (by making sure that btrfs_inode_in_log() returns false). This happens for expanding truncate operations, buffered writes, direct IO writes and when cloning extents to an offset greater than the inode's i_size. However the way we do it is racy, because in between setting the inode's last_sub_trans and last_log_commit fields, the log transaction ID that was assigned to last_sub_trans might be committed before we read the root's last_log_commit and assign that value to last_log_commit. If that happens it would make a future call to btrfs_inode_in_log() return true. This is a race that should be extremely unlikely to be hit in practice, and it is the same that was described by commit bc0939fcfab0d7 ("btrfs: fix race between marking inode needs to be logged and log syncing"). The fix would simply be to set last_log_commit to the value we assigned to last_sub_trans minus 1, like it was done in that commit. However updating these two fields plus the last_trans field is pointless here because all the callers of btrfs_cont_expand() (which is the only caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans() or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the next fsync will log the inode, as it makes btrfs_inode_in_log() return false. So just remove the code that explicitly sets the inode's last_trans, last_sub_trans and last_log_commit fields. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:01 +02:00
David Sterba	4c2bf276b5	btrfs: compression: drop kmap/kunmap from generic helpers The pages in compressed_pages are not from highmem anymore so we can drop the mapping for checksum calculation and inline extent. Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:00 +02:00
NeilBrown	3f79f6f624	btrfs: prevent rename2 from exchanging a subvol with a directory from different parents Cross-rename lacks a check when that would prevent exchanging a directory and subvolume from different parent subvolume. This causes data inconsistencies and is caught before commit by tree-checker, turning the filesystem to read-only. Calling the renameat2 with RENAME_EXCHANGE flags like renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1)) on two paths: namesrc = dir1/subvol1/dir2 namedest = subvol2/subvol3 will cause key order problem with following write time tree-checker report: [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258 [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5 [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561 [1194842.338772] item 0 key (256 1 0) itemoff 16123 itemsize 160 [1194842.338793] inode generation 3 size 16 mode 40755 [1194842.338801] item 1 key (256 12 256) itemoff 16111 itemsize 12 [1194842.338809] item 2 key (256 84 2248503653) itemoff 16077 itemsize 34 [1194842.338817] dir oid 258 type 2 [1194842.338823] item 3 key (256 84 2363071922) itemoff 16043 itemsize 34 [1194842.338830] dir oid 257 type 2 [1194842.338836] item 4 key (256 96 2) itemoff 16009 itemsize 34 [1194842.338843] item 5 key (256 96 3) itemoff 15975 itemsize 34 [1194842.338852] item 6 key (257 1 0) itemoff 15815 itemsize 160 [1194842.338863] inode generation 6 size 8 mode 40755 [1194842.338869] item 7 key (257 12 256) itemoff 15801 itemsize 14 [1194842.338876] item 8 key (257 84 2505409169) itemoff 15767 itemsize 34 [1194842.338883] dir oid 256 type 2 [1194842.338888] item 9 key (257 96 2) itemoff 15733 itemsize 34 [1194842.338895] item 10 key (258 12 256) itemoff 15719 itemsize 14 [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected [1194842.339245] ------------[ cut here ]------------ [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793 [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282 [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978 [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970 [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003 [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000 [1194842.512092] FS: 0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000 [1194842.512095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0 [1194842.512103] Call Trace: [1194842.512128] ? run_one_async_free+0x10/0x10 [btrfs] [1194842.631729] btree_csum_one_bio+0x1ac/0x1d0 [btrfs] [1194842.631837] run_one_async_start+0x18/0x30 [btrfs] [1194842.631938] btrfs_work_helper+0xd5/0x1d0 [btrfs] [1194842.647482] process_one_work+0x262/0x5e0 [1194842.647520] worker_thread+0x4c/0x320 [1194842.655935] ? process_one_work+0x5e0/0x5e0 [1194842.655946] kthread+0x135/0x160 [1194842.655953] ? set_kthread_struct+0x40/0x40 [1194842.655965] ret_from_fork+0x1f/0x30 [1194842.672465] irq event stamp: 1729 [1194842.672469] hardirqs last enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0 [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0 [1194842.672482] softirqs last enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0 The corrupted data will not be written, and filesystem can be unmounted and mounted again (all changes since the last commit will be lost). Add the missing check for new_ino so that all non-subvolumes must reside under the same parent subvolume. There's an exception allowing to exchange two subvolumes from any parents as the directory representing a subvolume is only a logical link and does not have any other structures related to the parent subvolume, unlike files, directories etc, that are always in the inode namespace of the parent subvolume. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.7+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-16 13:33:23 +02:00
Christoph Hellwig	c7c3a6dcb1	btrfs: store a block_device in struct btrfs_ordered_extent Store the block device instead of the gendisk in the btrfs_ordered_extent structure instead of acquiring a reference to it later. Note: this is from series removing bdgrab/bdput, btrfs is one of the last users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-07-22 15:50:15 +02:00
Naohiro Aota	abb99cfdaf	btrfs: properly split extent_map for REQ_OP_ZONE_APPEND Damien reported a test failure with btrfs/209. The test itself ran fine, but the fsck ran afterwards reported a corrupted filesystem. The filesystem corruption happens because we're splitting an extent and then writing the extent twice. We have to split the extent though, because we're creating too large extents for a REQ_OP_ZONE_APPEND operation. When dumping the extent tree, we can see two EXTENT_ITEMs at the same start address but different lengths. $ btrfs inspect dump-tree /dev/nullb1 -t extent ... item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in extract_ordered_extent(). Since extract_ordered_extent() uses create_io_em() to split an existing extent_map, we will have split->orig_start != split->start. Then, it will be logged with non-zero "extent data offset". Finally, the logged entries are replayed into a duplicated EXTENT_ITEM. Introduce and use proper splitting function for extent_map. The function is intended to be simple and specific usage for extract_ordered_extent() e.g. not supporting compression case (we do not allow splitting compressed extent_map anyway). There was a question raised by Qu, in summary why we want to split the extent map (and not the bio): The problem is not the limit on the zone end, which as you mention is the same as the block group end. The problem is that data write use zone append (ZA) operations. ZA BIOs cannot be split so a large extent may need to be processed with multiple ZA BIOs, While that is also true for regular writes, the major difference is that ZA are "nameless" write operation giving back the written sectors on completion. And ZA operations may be reordered by the block layer (not intentionally though). Combine both of these characteristics and you can see that the data for a large extent may end up being shuffled when written resulting in data corruption and the impossibility to map the extent to some start sector. To avoid this problem, zoned btrfs uses the principle "one data extent == one ZA BIO". So large extents need to be split. This is unfortunate, but we can revisit this later and optimize, e.g. merge back together the fragments of an extent once written if they actually were written sequentially in the zone. Reported-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ CC: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-07-07 17:42:45 +02:00
David Sterba	f216562731	btrfs: compression: don't try to compress if we don't have enough pages The early check if we should attempt compression does not take into account the number of input pages. It can happen that there's only one page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED have been processed, or an isolated page that won't be converted to an inline extent. The single page would be compressed but a later check would drop it again because the result size must be at least one block shorter than the input. That can never work with just one page. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-22 14:11:57 +02:00
David Sterba	1a9fd4172d	btrfs: fix typos in comments Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-22 14:11:57 +02:00
Qu Wenruo	bcd77455d5	btrfs: don't clear page extent mapped if we're not invalidating the full page [BUG] With current btrfs subpage rw support, the following script can lead to fs hang: $ mkfs.btrfs -f -s 4k $dev $ mount $dev -o nospace_cache $mnt $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt The fs will hang at btrfs_start_ordered_extent(). [CAUSE] In above test case, btrfs_invalidate() will be called with the following parameters: offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000 Since @offset is 0, btrfs_invalidate() will try to invalidate the full page, and finally call clear_page_extent_mapped() which will detach subpage structure from the page. And since the page no longer has subpage structure, the subpage dirty bitmap will be cleared, preventing the dirty range from being written back, thus no way to wake up the ordered extent. [FIX] Just follow other filesystems, only to invalidate the page if the range covers the full page. There are cases like truncate_setsize() which can call btrfs_invalidatepage() with offset == 0 and length != 0 for the last page of an inode. Although the old code will still try to invalidate the full page, we are still safe to just wait for ordered extent to finish. So it shouldn't cause extra problems. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:10 +02:00
Qu Wenruo	2d8ec40ee4	btrfs: make btrfs_page_mkwrite() to be subpage compatible Only set_page_dirty() and SetPageUptodate() is not subpage compatible. Convert them to subpage helpers, so that __extent_writepage_io() can submit page content correctly. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:10 +02:00
Qu Wenruo	6c9ac8be45	btrfs: make btrfs_truncate_block() to be subpage compatible btrfs_truncate_block() itself is already mostly subpage compatible, the only missing part is the page dirtying code. Currently if we have a sector that needs to be truncated, we set the sector aligned range delalloc, then set the full page dirty. The problem is, current subpage code requires subpage dirty bit to be set, or __extent_writepage_io() won't submit bio, thus leads to ordered extent never to finish. So this patch will make btrfs_truncate_block() to call btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the problem. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:10 +02:00
Qu Wenruo	d2a9106448	btrfs: make btrfs_set_range_writeback() subpage compatible Function btrfs_set_range_writeback() currently just sets the page writeback unconditionally. Change it to call the subpage helper so that we can handle both cases well. Since the subpage helpers needs btrfs_fs_info, also change the parameter to accept btrfs_inode. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:10 +02:00
Qu Wenruo	4750af3bbe	btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() In cow_file_range(), after we have succeeded creating an inline extent, we unlock the page with extent_clear_unlock_delalloc() by passing locked_page == NULL. For sectorsize == PAGE_SIZE case, this is just making the page lock and unlock harder to grab. But for incoming subpage case, it can be a big problem. For incoming subpage case, page locking have two entry points: - __process_pages_contig() In that case, we know exactly the range we want to lock (which only requires sector alignment). To handle the subpage requirement, we introduce btrfs_subpage::writers to page::private, and will update it in __process_pages_contig(). - Other directly lock/unlock_page() call sites Those won't touch btrfs_subpage::writers at all. This means, page locked by __process_pages_contig() can only be unlocked by __process_pages_contig(). Thankfully we already have the existing infrastructure in the form of @locked_page in various call sites. Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after creating an inline extent is the exception. It intentionally call extent_clear_unlock_delalloc() with locked_page == NULL, to also unlock current page (and clear its dirty/writeback bits). To co-operate with incoming subpage modifications, and make the page lock/unlock pair easier to understand, this patch will still call extent_clear_unlock_delalloc() with locked_page, and only unlock the page in __extent_writepage(). Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:10 +02:00
Qu Wenruo	b945a4637e	btrfs: make page Ordered bit to be subpage compatible This involves the following modification: - Ordered extent creation This is done in process_one_page(), now PAGE_SET_ORDERED will call subpage helper to do the work. - endio functions This is done in btrfs_mark_ordered_io_finished(). - btrfs_invalidatepage() - btrfs_cleanup_ordered_extents() Use the subpage page helper, and add an extra branch to exit if the locked page have covered the full range. Now the usage of page Ordered flag for ordered extent accounting is fully subpage compatible. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:10 +02:00
Qu Wenruo	968f2566ad	btrfs: fix hang when run_delalloc_range() failed [BUG] When running subpage preparation patches on x86, btrfs/125 will hang forever with one ordered extent never finished. [CAUSE] The test case btrfs/125 itself will always fail as the fix is never merged. When the test fails at balance, btrfs needs to cleanup the ordered extent in btrfs_cleanup_ordered_extents() for data reloc inode. The problem is in the sequence how we cleanup the page Order bit. Currently it works like: btrfs_cleanup_ordered_extents() \|- find_get_page(); \|- btrfs_page_clear_ordered(page); \| Now the page doesn't have Ordered bit anymore. \| !!! This also includes the first (locked) page !!! \| \|- offset += PAGE_SIZE \| This is to skip the first page \|- __endio_write_update_ordered() \|- btrfs_mark_ordered_io_finished(NULL) Except the first page, all ordered extents are finished. Then the locked page is cleaned up in __extent_writepage(): __extent_writepage() \|- If (PageError(page)) \|- end_extent_writepage() \|- btrfs_mark_ordered_io_finished(page) \|- if (btrfs_test_page_ordered(page)) \|- !!! The page gets skipped !!! The ordered extent is not decreased as the page doesn't have ordered bit anymore. This leaves the ordered extent with bytes_left == sectorsize, thus never finish. [FIX] The fix is to ensure we never clear page Ordered bit without running the ordered extent accounting. Here we choose to skip the locked page in btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can properly finish the ordered extent. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:09 +02:00
Qu Wenruo	f57ad93735	btrfs: rename PagePrivate2 to PageOrdered inside btrfs Inside btrfs we use Private2 page status to indicate we have an ordered extent with pending IO for the sector. But the page status name, Private2, tells us nothing about the bit itself, so this patch will rename it to Ordered. And with extra comment about the bit added, so reader who is still uncertain about the page Ordered status, will find the comment pretty easily. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:09 +02:00
Qu Wenruo	3b8358407a	btrfs: refactor btrfs_invalidatepage() for subpage support This patch will refactor btrfs_invalidatepage() for the incoming subpage support. The involved modifications are: - Use while() loop instead of "goto again;" - Use single variable to determine whether to delete extent states Each branch will also have comments why we can or cannot delete the extent states - Do qgroup free and extent states deletion per-loop Current code can only work for PAGE_SIZE == sectorsize case. This refactor also makes it clear what we do for different sectors: - Sectors without ordered extent We're completely safe to remove all extent states for the sector(s) - Sectors with ordered extent, but no Private2 bit This means the endio has already been executed, we can't remove all extent states for the sector(s). - Sectors with ordere extent, still has Private2 bit This means we need to decrease the ordered extent accounting. And then it comes to two different variants: * We have finished and removed the ordered extent Then it's the same as "sectors without ordered extent" * We didn't finished the ordered extent We can remove some extent states, but not all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:08 +02:00
Qu Wenruo	266a258678	btrfs: update comments in btrfs_invalidatepage() The existing comments in btrfs_invalidatepage() don't really get to the point, especially for what Private2 is really representing and how the race avoidance is done. The truth is, there are only three entrances to do ordered extent accounting: - btrfs_writepage_endio_finish_ordered() - __endio_write_update_ordered() Those two entrance are just endio functions for dio and buffered write. - btrfs_invalidatepage() But there is a pitfall, in endio functions there is no check on whether the ordered extent is already accounted. They just blindly clear the Private2 bit and do the accounting. So it's all btrfs_invalidatepage()'s responsibility to make sure we won't do double account for the same sector. That's why in btrfs_invalidatepage() we have to wait for page writeback, this will ensure all submitted bios have finished, thus their endio functions have finished the accounting on the ordered extent. Then we also check page Private2 to ensure that, we only run ordered extent accounting on pages who has no bio submitted. This patch will rework related comments to make it more clear on the race and how we use wait_on_page_writeback() and Private2 to prevent double accounting on ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:08 +02:00
Qu Wenruo	e65f152e43	btrfs: refactor how we finish ordered extent io for endio functions Btrfs has two endio functions to mark certain io range finished for ordered extents: - __endio_write_update_ordered() This is for direct IO - btrfs_writepage_endio_finish_ordered() This for buffered IO. However they go different routines to handle ordered extent io: - Whether to iterate through all ordered extents __endio_write_update_ordered() will but btrfs_writepage_endio_finish_ordered() will not. In fact, iterating through all ordered extents will benefit later subpage support, while for current PAGE_SIZE == sectorsize requirement this behavior makes no difference. - Whether to update page Private2 flag __endio_write_update_ordered() will not update page Private2 flag as for iomap direct IO, the page can not be even mapped. While btrfs_writepage_endio_finish_ordered() will clear Private2 to prevent double accounting against btrfs_invalidatepage(). Those differences are pretty subtle, and the ordered extent iterations code in callers makes code much harder to read. So this patch will introduce a new function, btrfs_mark_ordered_io_finished(), to do the heavy lifting: - Iterate through all ordered extents in the range - Do the ordered extent accounting - Queue the work for finished ordered extent This function has two new feature: - Proper underflow detection and recovery The old underflow detection will only detect the problem, then continue. No proper info like root/inode/ordered extent info, nor noisy enough to be caught by fstests. Furthermore when underflow happens, the ordered extent will never finish. New error detection will reset the bytes_left to 0, do proper kernel warning, and output extra info including root, ino, ordered extent range, the underflow value. - Prevent double accounting based on Private2 flag Now if we find a range without Private2 flag, we will skip to next range. As that means someone else has already finished the accounting of ordered extent. This makes no difference for current code, but will be a critical part for incoming subpage support, as we can call btrfs_mark_ordered_io_finished() for multiple sectors if they are beyond inode size. Thus such double accounting prevention is a key feature for subpage. Now both endio functions only need to call that new function. And since the only caller of btrfs_dec_test_first_ordered_pending() is removed, also remove btrfs_dec_test_first_ordered_pending() completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:08 +02:00
Qu Wenruo	87b4d86baa	btrfs: make Private2 lifespan more consistent Currently we use page Private2 bit to indicate that we have ordered extent for the page range. But the lifespan of it is not consistent, during regular writeback path, there are two locations to clear the same PagePrivate2: T ----- Page marked Dirty \| + ----- Page marked Private2, through btrfs_run_dealloc_range() \| + ----- Page cleared Private2, through btrfs_writepage_cow_fixup() \| in __extent_writepage_io() \| ^^^ Private2 cleared for the first time \| + ----- Page marked Writeback, through btrfs_set_range_writeback() \| in __extent_writepage_io(). \| + ----- Page cleared Private2, through \| btrfs_writepage_endio_finish_ordered() \| ^^^ Private2 cleared for the second time. \| + ----- Page cleared Writeback, through btrfs_writepage_endio_finish_ordered() Currently PagePrivate2 is mostly to prevent ordered extent accounting being executed for both endio and invalidatepage. Thus only the one who cleared page Private2 is responsible for ordered extent accounting. But the fact is, in btrfs_writepage_endio_finish_ordered(), page Private2 is cleared and ordered extent accounting is executed unconditionally. The race prevention only happens through btrfs_invalidatepage(), where we wait for the page writeback first, before checking the Private2 bit. This means, Private2 is also protected by Writeback bit, and there is no need for btrfs_writepage_cow_fixup() to clear Priavte2. This patch will change btrfs_writepage_cow_fixup() to just check PagePrivate2, not to clear it. The clearing will happen in either btrfs_invalidatepage() or btrfs_writepage_endio_finish_ordered(). This makes the Private2 bit easier to understand, just meaning the page has unfinished ordered extent attached to it. And this patch is a hard requirement for the incoming refactoring for how we finished ordered IO for endio context, as the coming patch will check Private2 to determine if we need to do the ordered extent accounting. Thus this patch is definitely needed or we will hang due to unfinished ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:08 +02:00
Qu Wenruo	38a39ac77e	btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:08 +02:00

1 2 3 4 5 ...

2062 Commits