78354 Commits

Author SHA1 Message Date
Jason Yan
43bd6f1b49 ext4: goto right label 'failed_mount3a'
Before these two branches neither loaded the journal nor created the
xattr cache. So the right label to goto is 'failed_mount3a'. Although
this did not cause any issues because the error handler validated if the
pointer is null. However this still made me confused when reading
the code. So it's still worth to modify to goto the right label.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-2-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
e64e6ca909 ext4: adjust fast commit disable judgement order in ext4_fc_track_inode
If fastcommit is already disabled, there isn't need to mark inode ineligible.
So move 'ext4_fc_disabled()' judgement bofore 'ext4_should_journal_data(inode)'
judgement which can avoid to do meaningless judgement.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916083836.388347-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
b7b80a35fb ext4: factor out ext4_fc_disabled()
Factor out ext4_fc_disabled(). No functional change.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916083836.388347-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
ccbf8eeb39 ext4: fix miss release buffer head in ext4_fc_write_inode
In 'ext4_fc_write_inode' function first call 'ext4_get_inode_loc' get 'iloc',
after use it miss release 'iloc.bh'.
So just release 'iloc.bh' before 'ext4_fc_write_inode' return.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220914100859.1415196-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
243d1a5d50 jbd2: fix potential use-after-free in jbd2_fc_wait_bufs
In 'jbd2_fc_wait_bufs' use 'bh' after put buffer head reference count
which may lead to use-after-free.
So judge buffer if uptodate before put buffer head reference count.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220914100812.1414768-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
e0d5fc7a6d jbd2: fix potential buffer head reference count leak
As in 'jbd2_fc_wait_bufs' if buffer isn't uptodate, will return -EIO without
update 'journal->j_fc_off'. But 'jbd2_fc_release_bufs' will release buffer head
from ‘j_fc_off - 1’ if 'bh' is NULL will terminal release which will lead to
buffer head buffer head reference count leak.
To solve above issue, update 'journal->j_fc_off' before return -EIO.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220914100812.1414768-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Zhihao Cheng
7177dd009c ext4: fix dir corruption when ext4_dx_add_entry() fails
Following process may lead to fs corruption:
1. ext4_create(dir/foo)
 ext4_add_nondir
  ext4_add_entry
   ext4_dx_add_entry
     a. add_dirent_to_buf
      ext4_mark_inode_dirty
      ext4_handle_dirty_metadata   // dir inode bh is recorded into journal
     b. ext4_append    // dx_get_count(entries) == dx_get_limit(entries)
       ext4_bread(EXT4_GET_BLOCKS_CREATE)
        ext4_getblk
         ext4_map_blocks
          ext4_ext_map_blocks
            ext4_mb_new_blocks
             dquot_alloc_block
              dquot_alloc_space_nodirty
               inode_add_bytes    // update dir's i_blocks
            ext4_ext_insert_extent
	     ext4_ext_dirty  // record extent bh into journal
              ext4_handle_dirty_metadata(bh)
	      // record new block into journal
       inode->i_size += inode->i_sb->s_blocksize   // new size(in mem)
     c. ext4_handle_dirty_dx_node(bh2)
	// record dir's new block(dx_node) into journal
     d. ext4_handle_dirty_dx_node((frame - 1)->bh)
     e. ext4_handle_dirty_dx_node(frame->bh)
     f. do_split    // ret err!
     g. add_dirent_to_buf
	 ext4_mark_inode_dirty(dir)  // update raw_inode on disk(skipped)
2. fsck -a /dev/sdb
 drop last block(dx_node) which beyonds dir's i_size.
  /dev/sdb: recovering journal
  /dev/sdb contains a file system with errors, check forced.
  /dev/sdb: Inode 12, end of extent exceeds allowed value
	(logical block 128, physical block 3938, len 1)
3. fsck -fn /dev/sdb
 dx_node->entry[i].blk > dir->i_size
  Pass 2: Checking directory structure
  Problem in HTREE directory inode 12 (/dir): bad block number 128.
  Clear HTree index? no
  Problem in HTREE directory inode 12: block #3 has invalid depth (2)
  Problem in HTREE directory inode 12: block #3 has bad max hash
  Problem in HTREE directory inode 12: block #3 not referenced

Fix it by marking inode dirty directly inside ext4_append().
Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216466
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220911045204.516460-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Gaosheng Cui
ebd5d23e88 ext4: remove ext4_inline_data_fiemap() declaration
ext4_inline_data_fiemap() has been removed since
commit d3b6f23f7167 ("ext4: move ext4_fiemap to use iomap framework"),
so remove it.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220909065307.1155201-1-cuigaosheng1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jeff Layton
a642c2c082 ext4: fix i_version handling in ext4
ext4 currently updates the i_version counter when the atime is updated
during a read. This is less than ideal as it can cause unnecessary cache
invalidations with NFSv4 and unnecessary remeasurements for IMA.

The increment in ext4_mark_iloc_dirty is also problematic since it can
corrupt the i_version counter for ea_inodes. We aren't bumping the file
times in ext4_mark_iloc_dirty, so changing the i_version there seems
wrong, and is the cause of both problems.

Remove that callsite and add increments to the setattr, setxattr and
ioctl codepaths, at the same times that we update the ctime. The
i_version bump that already happens during timestamp updates should take
care of the rest.

In ext4_move_extents, increment the i_version on both inodes, and also
add in missing ctime updates.

[ Some minor updates since we've already enabled the i_version counter
  unconditionally already via another patch series. -- TYT ]

Cc: stable@kernel.org
Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jan Kara
5fc4cbd9fd mbcache: Avoid nesting of cache->c_list_lock under bit locks
Commit 307af6c87937 ("mbcache: automatically delete entries from cache
on freeing") started nesting cache->c_list_lock under the bit locks
protecting hash buckets of the mbcache hash table in
mb_cache_entry_create(). This causes problems for real-time kernels
because there spinlocks are sleeping locks while bitlocks stay atomic.
Luckily the nesting is easy to avoid by holding entry reference until
the entry is added to the LRU list. This makes sure we cannot race with
entry deletion.

Cc: stable@kernel.org
Fixes: 307af6c87937 ("mbcache: automatically delete entries from cache on freeing")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908091032.10513-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Andrew Perepechko
34fc8768ec jbd2: wake up journal waiters in FIFO order, not LIFO
LIFO wakeup order is unfair and sometimes leads to a journal
user not being able to get a journal handle for hundreds of
transactions in a row.

FIFO wakeup can make things more fair.

Cc: stable@kernel.org
Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220907165959.1137482-1-alexey.lyashkov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jinke Han
d1052d236e ext4: place buffer head allocation before handle start
In our product environment, we encounter some jbd hung waiting handles to
stop while several writters were doing memory reclaim for buffer head
allocation in delay alloc write path. Ext4 do buffer head allocation with
holding transaction handle which may be blocked too long if the reclaim
works not so smooth. According to our bcc trace, the reclaim time in
buffer head allocation can reach 258s and the jbd transaction commit also
take almost the same time meanwhile. Except for these extreme cases,
we often see several seconds delays for cgroup memory reclaim on our
servers. This is more likely to happen considering docker environment.

One thing to note, the allocation of buffer heads is as often as page
allocation or more often when blocksize less than page size. Just like
page cache allocation, we should also place the buffer head allocation
before startting the handle.

Cc: stable@kernel.org
Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Zhang Yi
0b73284c56 ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodate
Recently we notice that ext4 filesystem would occasionally fail to read
metadata from disk and report error message, but the disk and block
layer looks fine. After analyse, we lockon commit 88dbcbb3a484
("blkdev: avoid migration stalls for blkdev pages"). It provide a
migration method for the bdev, we could move page that has buffers
without extra users now, but it lock the buffers on the page, which
breaks the fragile metadata read operation on ext4 filesystem,
ext4_read_bh_lock() was copied from ll_rw_block(), it depends on the
assumption of that locked buffer means it is under IO. So it just
trylock the buffer and skip submit IO if it lock failed, after
wait_on_buffer() we conclude IO error because the buffer is not
uptodate.

This issue could be easily reproduced by add some delay just after
buffer_migrate_lock_buffers() in __buffer_migrate_folio() and do
fsstress on ext4 filesystem.

  EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193:
  comm fsstress: reading directory lblock 0
  EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334:
  comm fsstress: reading directory lblock 0

Fix it by removing the trylock logic in ext4_read_bh_lock(), just lock
the buffer and submit IO if it's not uptodate, and also leave over
readahead helper.

Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220831074629.3755110-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:51 -04:00
Jeff Layton
1ff2030739 ext4: unconditionally enable the i_version counter
The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).

Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.

Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally.  While we're in here, mark the i_version mount
option Opt_removed.

[ Removed leftover bits of i_version from ext4_apply_options() since it
  now can't ever be set in ctx->mask_s_flags -- lczerner ]

Cc: stable@kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220824160349.39664-3-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:45:00 -04:00
Linus Torvalds
d649d2c49b pstore revert for v6.0-rc8
- Revert crypto acomp migration (Guilherme G. Piccoli)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM3CssWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJl75D/42DaeB92odK/3XG9cI1Frp4YS0
 vcTIecUeheNrTf4okUjeL0kQjUJgDQLWYNxZU6O3ljNM3lItDegV14Ij4tiPJlNK
 pOg8e+ddnWmhTB6c1BrLwVIjPDUnJmhqd1L2G6D/1djlcQU2TzN/amjBE4PQYrC4
 Kyqn8nmrVt5CgCvkV4PPDuIiyeGx808bFu+rK13J6BZYDH4mTZWqky25+wFuresn
 HNBQ0Xh7otaeeIMdbdohaitANlr7xHpUOLDuHnQaZ9Od1wJdFqNFpQ0sYDwvsqhY
 aTyNAAkZig9xMp10Sr6lCDfT/ZHkHDtSNOqn30iwPrFX83QVbbnaQhRsN7EK8Caz
 v7dx2HDjP5MxkYetT+3qPW4waScc5/s7dJNXoK/3E3oWrKoFRwx2CrmZUEuHmjHC
 S1M/kG+MoizxGAWbtvB93XbxXJlULdgmv1VKQyBjZRZ9JVBP7AnQmstljKjK0yn2
 o/pgTFRduiTj7+N9iU7h7GxHNLXJm0YHwCZw1QhORc1dg14jbMkNEu6GN6ZNfuvg
 ADYs2tRN7YA7oGjACaCNAEuGxjY5/quvPAdYHdwankXBj8S8O2/BCxa02mbrPevC
 7p32nwzRtUz4G+bjJr4N/RGO1q/Qzn5IcZnzOxGUVn5TC729JxkfRUFrb/m0l0PP
 PgfmGgNjSQ4KDaYIMQ==
 =KoAn
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore revert from Kees Cook:
 "A misbehavior with some compression backends in pstore was just
  discovered due to the recent crypto acomp migration.

  Since we're so close to release, it seems better to just simply revert
  it, and we can figure out what's going on without leaving it broken
  for a release.

   - Revert crypto acomp migration (Guilherme G. Piccoli)"

* tag 'pstore-v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  Revert "pstore: migrate to crypto acomp interface"
2022-09-30 08:54:14 -07:00
Guilherme G. Piccoli
40158dbf7e Revert "pstore: migrate to crypto acomp interface"
This reverts commit e4f0a7ec586b7644107839f5394fb685cf1aadcc.

When using this new interface, both efi_pstore and ramoops
backends are unable to properly decompress dmesg if using
zstd, lz4 and lzo algorithms (and maybe more). It does succeed
with deflate though.

The message observed in the kernel log is:

[2.328828] pstore: crypto_acomp_decompress failed, ret = -22!

The pstore infrastructure is able to collect the dmesg with
both backends tested, but since decompression fails it's
unreadable. With this revert everything is back to normal.

Fixes: e4f0a7ec586b ("pstore: migrate to crypto acomp interface")
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220929215515.276486-1-gpiccoli@igalia.com
2022-09-30 08:16:06 -07:00
Lukas Czerner
cbfecb927f fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:02:00 -04:00
Lukas Czerner
50f094a558 ext4: don't increase iversion counter for ea_inodes
ea_inodes are using i_version for storing part of the reference count so
we really need to leave it alone.

The problem can be reproduced by xfstest ext4/026 when iversion is
enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
inodes in ext4_mark_iloc_dirty().

Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220824160349.39664-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:44 -04:00
Jan Kara
61a1d87a32 ext4: fix check for block being out of directory size
The check in __ext4_read_dirblock() for block being outside of directory
size was wrong because it compared block number against directory size
in bytes. Fix it.

Fixes: 65f8ea4cd57d ("ext4: check if directory block is within i_size")
CVE: CVE-2022-1184
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
5bdf402a05 fs/buffer: make submit_bh & submit_bh_wbc return type as void
submit_bh/submit_bh_wbc are non-blocking functions which just submit
the bio and return. The caller of submit_bh/submit_bh_wbc needs to wait
on buffer till I/O completion and then check buffer head's b_state field
to know if there was any I/O error.

Hence there is no need for these functions to have any return type.
Even now they always returns 0. Hence drop the return value and make
their return type as void to avoid any confusion.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/cb66ef823374cdd94d2d03083ce13de844fffd41.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
ab620620bf fs/buffer: drop useless return value of submit_bh
submit_bh always returns 0. This patch drops the useless return value of
submit_bh from __sync_dirty_buffer(). Once all of submit_bh callers are
cleaned up, we can make it's return type as void.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a98a6ddfac68f73d684c2724952e825bc1f4d238.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
c547524852 fs/ntfs: drop useless return value of submit_bh from ntfs_submit_bh_for_read
submit_bh always returns 0. This patch drops the useless return value of
submit_bh from ntfs_submit_bh_for_read(). Once all of submit_bh callers are
cleaned up, we can make it's return type as void.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d82eb29e8dbc52fe13a7affef5c907ea4076aa31.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
f3ed5df398 jbd2: drop useless return value of submit_bh
submit_bh always returns 0. This patch cleans up 2 of it's caller
in jbd2 to drop submit_bh's useless return value.
Once all submit_bh callers are cleaned up, we can make it's return
type as void.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/e069c0539be0aec61abcdc6f6141982ec85d489d.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Lalith Rajendran
3b575495ab ext4: make ext4_lazyinit_thread freezable
ext4_lazyinit_thread is not set freezable. Hence when the thread calls
try_to_freeze it doesn't freeze during suspend and continues to send
requests to the storage during suspend, resulting in suspend failures.

Cc: stable@kernel.org
Signed-off-by: Lalith Rajendran <lalithkraj@google.com>
Link: https://lore.kernel.org/r/20220818214049.1519544-1-lalithkraj@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:27 -04:00
Linus Torvalds
987a926c1d fix for breakage in dump_user_range()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzXkrwAKCRBZ7Krx/gZQ
 6+BcAP4+OfgZO6qStFjroURhZ8o3qfCJpYde/fGN5S/F/5UMdAD6A+4yozxXLdMZ
 wb2oB0XhxHv33vFdFfEeodquTnn+MAQ=
 =1Mnt
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull coredump fix from Al Viro:
 "Fix for breakage in dump_user_range()"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [coredump] don't use __kernel_write() on kmap_local_page()
2022-09-29 14:37:45 -07:00
Jakub Kicinski
accc3b4a57 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29 14:30:51 -07:00
Tetsuo Handa
cbddcc4fa3 btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer
syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
commit bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code")
missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
not be moved to after clean_tree_block() because clean_tree_block() is
calling btrfs_header_generation() since commit 55c69072d6bd5be1 ("Btrfs:
Fix extent_buffer usage when nodesize != leafsize").

Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
can't move btrfs_set_header_generation() to before memzero_extent_buffer().
Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().

Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
Reported-by: syzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
Fixes: bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
db21370bff btrfs: drop extent map range more efficiently
Currently when dropping extent maps for a file range, through
btrfs_drop_extent_map_range(), we do the following non-optimal things:

1) We lookup for extent maps one by one, always starting the search from
   the root of the extent map tree. This is not efficient if we have
   multiple extent maps in the range;

2) We check on every iteration if we have the 'split' and 'split2' spare
   extent maps in case we need to split an extent map that intersects our
   range but also crosses its boundaries (to the left, to the right or
   both cases). If our target range is for example:

       [2M, 8M)

   And we have 3 extents maps in the range:

       [1M, 3M) [3M, 6M) [6M, 10M[

   The on the first iteration we allocate two extent maps for 'split' and
   'split2', and use the 'split' to split the first extent map, so after
   the split we set 'split' to 'split2' and then set 'split2' to NULL.

   On the second iteration, we don't need to split the second extent map,
   but because 'split2' is now NULL, we allocate a new extent map for
   'split2'.

   On the third iteration we need to split the third extent map, so we
   use the extent map pointed by 'split'.

   So we ended up allocating 3 extent maps for splitting, but all we
   needed was 2 extent maps. We never need to allocate more than 2,
   because extent maps that need to be split are always the first one
   and the last one in the target range.

Improve on this by:

1) Using rb_next() to move on to the next extent map. This results in
   iterating over less nodes of the tree and it does not require comparing
   the ranges of nodes to our start/end offset;

2) Allocate the 2 extent maps for splitting before entering the loop and
   never allocate more than 2. In practice it's very rare to have the
   combination of both extent map allocations fail, since we have a
   dedicated slab for extent maps, and also have the need to split two
   extent maps.

This patch is part of a patchset comprised of the following patches:

   btrfs: fix missed extent on fsync after dropping extent maps
   btrfs: move btrfs_drop_extent_cache() to extent_map.c
   btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
   btrfs: use cond_resched_rwlock_write() during inode eviction
   btrfs: move open coded extent map tree deletion out of inode eviction
   btrfs: add helper to replace extent map range with a new extent map
   btrfs: remove the refcount warning/check at free_extent_map()
   btrfs: remove unnecessary extent map initializations
   btrfs: assert tree is locked when clearing extent map from logging
   btrfs: remove unnecessary NULL pointer checks when searching extent maps
   btrfs: remove unnecessary next extent map search
   btrfs: avoid pointless extent map tree search when flushing delalloc
   btrfs: drop extent map range more efficiently

And the following fio test was done before and after applying the whole
patchset, on a non-debug kernel (Debian's default kernel config) on a 12
cores Intel box with 64G of ram:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nvme0n1
   MNT=/mnt/nvme0n1
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS="-R free-space-tree -O no-holes"

   cat <<EOF > /tmp/fio-job.ini
   [writers]
   rw=randwrite
   fsync=8
   fallocate=none
   group_reporting=1
   direct=0
   bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
   ioengine=psync
   filesize=2G
   runtime=300
   time_based
   directory=$MNT
   numjobs=8
   thread
   EOF

   echo performance | \
       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   echo
   echo "Using config:"
   echo
   cat /tmp/fio-job.ini
   echo

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini

   umount $MNT

Result before applying the patchset:

   WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec

Result after applying the patchset:

   WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
b54bb86556 btrfs: avoid pointless extent map tree search when flushing delalloc
When flushing delalloc, in COW mode at cow_file_range(), before entering
the loop that allocates extents and creates ordered extents, we do a call
to btrfs_drop_extent_map_range() for the whole range. This is pointless
because in the loop we call create_io_em(), which will also call
btrfs_drop_extent_map_range() before inserting the new extent map.

So remove that call at cow_file_range() not only because it is not needed,
but also because it will make the btrfs_drop_extent_map_range() calls made
from create_io_em() waste time searching the extent map tree, and that
tree can be large for files with many extents. It also makes us waste time
at btrfs_drop_extent_map_range() allocating and freeing the split extent
maps for nothing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
6c05813ebb btrfs: remove unnecessary next extent map search
At __tree_search(), and its single caller __lookup_extent_mapping(), there
is no point in finding the next extent map that starts after the search
offset if we were able to find the previous extent map that ends before
our search offset, because __lookup_extent_mapping() ignores the next
acceptable extent map if we were able to find the previous one.

So just return immediately if we were able to find the previous extent
map, therefore avoiding wasting time iterating the tree looking for the
next extent map which will not be used by __lookup_extent_mapping().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
08f088dd63 btrfs: remove unnecessary NULL pointer checks when searching extent maps
The previous and next pointer arguments passed to __tree_search() are
never NULL as the only caller of this function, __lookup_extent_mapping(),
always passes the address of two on stack pointers. So remove the NULL
checks and add assertions to verify the pointers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
74333c7d87 btrfs: assert tree is locked when clearing extent map from logging
When calling clear_em_logging() we should have a write lock on the extent
map tree, as we will try to merge the extent map with the previous and
next ones in the tree. So assert that we have a write lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
2e0cdaa028 btrfs: remove unnecessary extent map initializations
When allocating an extent map, we use kmem_cache_zalloc() which guarantees
the returned memory is initialized to zeroes, therefore it's pointless
to initialize the generation and flags of the extent map to zero again.

Remove those initializations, as they are pointless and slightly increase
the object text size.

Before removing them:

   $ size fs/btrfs/extent_map.o
      text	   data	    bss	    dec	    hex	filename
      9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o

After removing them:

   $ size fs/btrfs/extent_map.o
      text	   data	    bss	    dec	    hex	filename
      9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
ad5d6e9148 btrfs: remove the refcount warning/check at free_extent_map()
At free_extent_map(), it's pointless to have a WARN_ON() to check if the
refcount of the extent map is zero. Such check is already done by the
refcount_t module and refcount_dec_and_test(), which loudly complains if
we try to decrement a reference count that is currently 0.

The WARN_ON() dates back to the time when used a regular atomic_t type
for the reference counter, before we switched to the refcount_t type.
The main goal of the refcount_t type/module is precisely to catch such
types of bugs and loudly complain if they happen.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
a1ba4c080b btrfs: add helper to replace extent map range with a new extent map
We have several places that need to drop all the extent maps in a given
file range and then add a new extent map for that range. Currently they
call btrfs_drop_extent_map_range() to delete all extent maps in the range
and then keep trying to add the new extent map in a loop that keeps
retrying while the insertion of the new extent map fails with -EEXIST.

So instead of repeating this logic, add a helper to extent_map.c that
does these steps and name it btrfs_replace_extent_map_range(). Also add
a comment about why the retry loop is necessary.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
9c9d1b4f74 btrfs: move open coded extent map tree deletion out of inode eviction
Move the loop that removes all the extent maps from the inode's extent
map tree during inode eviction out of inode.c and into extent_map.c, to
btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
extent map tree should be in extent_map.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
99ba0c8150 btrfs: use cond_resched_rwlock_write() during inode eviction
At evict_inode_truncate_pages(), instead of manually checking if
rescheduling is needed, then unlock the extent map tree, reschedule and
then write lock again the tree, use the helper cond_resched_rwlock_write()
which does all that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
f3109e33bb btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
Instead of open coding the end offset calculation of an extent map, use
the helper extent_map_end() and cache its result in a local variable,
since it's used several times.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
4c0c8cfc84 btrfs: move btrfs_drop_extent_cache() to extent_map.c
The function btrfs_drop_extent_cache() doesn't really belong at file.c
because what it does is drop a range of extent maps for a file range.
It directly allocates and manipulates extent maps, by dropping,
splitting and replacing them in an extent map tree, so it should be
located at extent_map.c, where all manipulations of an extent map tree
and its extent maps are supposed to be done.

So move it out of file.c and into extent_map.c. Additionally do the
following changes:

1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
   clear about what it does. The term "cache" is a bit confusing as it's
   not widely used, "extent maps" or "extent mapping" is much more common;

2) Change its 'skip_pinned' argument from int to bool;

3) Turn several of its local variables from int to bool, since they are
   used as booleans;

4) Move the declaration of some variables out of the function's main
   scope and into the scopes where they are used;

5) Remove pointless assignment of false to 'modified' early in the while
   loop, as later that variable is set and it's not used before that
   second assignment;

6) Remove checks for NULL before calling free_extent_map().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
cef7820d6a btrfs: fix missed extent on fsync after dropping extent maps
When dropping extent maps for a range, through btrfs_drop_extent_cache(),
if we find an extent map that starts before our target range and/or ends
before the target range, and we are not able to allocate extent maps for
splitting that extent map, then we don't fail and simply remove the entire
extent map from the inode's extent map tree.

This is generally fine, because in case anyone needs to access the extent
map, it can just load it again later from the respective file extent
item(s) in the subvolume btree. However, if that extent map is new and is
in the list of modified extents, then a fast fsync will miss the parts of
the extent that were outside our range (that needed to be split),
therefore not logging them. Fix that by marking the inode for a full
fsync. This issue was introduced after removing BUG_ON()s triggered when
the split extent map allocations failed, done by commit 7014cdb49305ed
("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
the fast fsync path already existed but was very recent.

Also, in the case where we could allocate extent maps for the split
operations but then fail to add a split extent map to the tree, mark the
inode for a full fsync as well. This is not supposed to ever fail, and we
assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
not set), it's the correct thing to do to make sure a fast fsync will not
miss a new extent.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Jeff Layton
3050dfa63e btrfs: remove stale prototype of btrfs_write_inode
This function no longer exists, was removed in 3c4276936f6f ("Btrfs: fix
btrfs_write_inode vs delayed iput deadlock").

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:29 +02:00
Stefan Roesch
926078b21d btrfs: enable nowait async buffered writes
Enable nowait async buffered writes in btrfs_do_write_iter() and
btrfs_file_open().

In this version encoded buffered writes have the optimization not
enabled. Encoded writes are enabled by using an ioctl. io_uring
currently does not support ioctls. This might be enabled in the future.

Performance results:

  For fio the following results have been obtained with a queue depth of
  1 and 4k block size (runtime 600 secs):

                 sequential writes:
                 without patch           with patch      libaio     psync
  iops:              55k                    134k          117K       148K
  bw:               221MB/s                 538MB/s       469MB/s    592MB/s
  clat:           15286ns                    82ns         994ns     6340ns

For an io depth of 1, the new patch improves throughput by over two
times (compared to the existing behavior, where buffered writes are
processed by an io-worker process) and also the latency is considerably
reduced. To achieve the same or better performance with the existing
code an io depth of 4 is required.  Increasing the iodepth further does
not lead to improvements.

The tests have been run like this:

./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \
  --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
  --numjobs=1 --filename=...
./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \
  --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
  --numjobs=1 --filename=...
./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \
  --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
  --numjobs=1 --filename=...

Testing:
  This patch has been tested with xfstests, fsx, fio. xfstests shows no new
  diffs compared to running without the patch series.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:29 +02:00
Stefan Roesch
c922b016f3 btrfs: assert nowait mode is not used for some btree search functions
Adds nowait asserts to btree search functions which are not used by
buffered IO and direct IO paths.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:29 +02:00
Stefan Roesch
965f47aeb5 btrfs: make btrfs_buffered_write nowait compatible
We need to avoid unconditionally calling balance_dirty_pages_ratelimited
as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags
with the BDP_ASYNC in case the buffered write is nowait, returning
EAGAIN eventually.

It also moves the function after the again label. This can cause the
function to be called a bit later, but this should have no impact in the
real world.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Stefan Roesch
304e45acdb btrfs: plumb NOWAIT through the write path
We have everywhere setup for nowait, plumb NOWAIT through the write path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Stefan Roesch
2fcab928cc btrfs: make lock_and_cleanup_extent_if_need nowait compatible
Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the
nowait parameter is specified we try to lock the extent in nowait mode.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Stefan Roesch
fc22600012 btrfs: make prepare_pages nowait compatible
Add nowait parameter to the prepare_pages function. In case nowait is
specified for an async buffered write request, do a nowait allocation or
return -EAGAIN.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Josef Bacik
80f9d24130 btrfs: make btrfs_check_nocow_lock nowait compatible
Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
a nowait flag to btrfs_check_nocow_lock so it can be used by the write
path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Josef Bacik
d2c7a19f5c btrfs: add btrfs_try_lock_ordered_range
For IOCB_NOWAIT we're going to want to use try lock on the extent lock,
and simply bail if there's an ordered extent in the range because the
only choice there is to wait for the ordered extent to complete.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Josef Bacik
1daedb1d6b btrfs: add the ability to use NO_FLUSH for data reservations
In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
data reservations, so plumb this through the delalloc reservation
system.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00