Commit Graph

4992 Commits

Author SHA1 Message Date
Maximilian Heyne
a6b3bfe176 ext4: fix corruption during on-line resize
We observed a corruption during on-line resize of a file system that is
larger than 16 TiB with 4k block size. With having more then 2^32 blocks
resize_inode is turned off by default by mke2fs. The issue can be
reproduced on a smaller file system for convenience by explicitly
turning off resize_inode. An on-line resize across an 8 GiB boundary (the
size of a meta block group in this setup) then leads to a corruption:

  dev=/dev/<some_dev> # should be >= 16 GiB
  mkdir -p /corruption
  /sbin/mke2fs -t ext4 -b 4096 -O ^resize_inode $dev $((2 * 2**21 - 2**15))
  mount -t ext4 $dev /corruption

  dd if=/dev/zero bs=4096 of=/corruption/test count=$((2*2**21 - 4*2**15))
  sha1sum /corruption/test
  # 79d2658b39dcfd77274e435b0934028adafaab11  /corruption/test

  /sbin/resize2fs $dev $((2*2**21))
  # drop page cache to force reload the block from disk
  echo 1 > /proc/sys/vm/drop_caches

  sha1sum /corruption/test
  # 3c2abc63cbf1a94c9e6977e0fbd72cd832c4d5c3  /corruption/test

2^21 = 2^15*2^6 equals 8 GiB whereof 2^15 is the number of blocks per
block group and 2^6 are the number of block groups that make a meta
block group.

The last checksum might be different depending on how the file is laid
out across the physical blocks. The actual corruption occurs at physical
block 63*2^15 = 2064384 which would be the location of the backup of the
meta block group's block descriptor. During the on-line resize the file
system will be converted to meta_bg starting at s_first_meta_bg which is
2 in the example - meaning all block groups after 16 GiB. However, in
ext4_flex_group_add we might add block groups that are not part of the
first meta block group yet. In the reproducer we achieved this by
substracting the size of a whole block group from the point where the
meta block group would start. This must be considered when updating the
backup block group descriptors to follow the non-meta_bg layout. The fix
is to add a test whether the group to add is already part of the meta
block group or not.

Fixes: 01f795f9e0 ("ext4: add online resizing support for meta_bg and 64-bit file systems")
Cc:  <stable@vger.kernel.org>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Tested-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Reviewed-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Link: https://lore.kernel.org/r/20240215155009.94493-1-mheyne@amazon.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07 13:32:54 -05:00
Jan Kara
fa60629380 ext4: don't report EOPNOTSUPP errors from discard
When ext4 is mounted without journal, with discard mount option, and on
a device not supporting trim, we print error for each and every freed
extent. This is not only useless but actively harmful. Instead ignore
the EOPNOTSUPP error. Trim is only advisory anyway and when the
filesystem has journal we silently ignore trim error as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240213101601.17463-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07 13:32:54 -05:00
Jan Kara
7f48212678 ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()
ext4_xattr_block_set() drops ea_inode reference in two places. Handling
it just under the 'cleanup' label is enough so drop the second
occurence.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240209112107.10585-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07 13:32:54 -05:00
Jan Kara
8208c41c43 ext4: fold quota accounting into ext4_xattr_inode_lookup_create()
When allocating EA inode, quota accounting is done just before
ext4_xattr_inode_lookup_create(). Logically these two operations belong
together so just fold quota accounting into
ext4_xattr_inode_lookup_create(). We also make
ext4_xattr_inode_lookup_create() return the looked up / created inode to
convert the function to a more standard calling convention.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240209112107.10585-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 23:44:50 -05:00
Baokun Li
4fbf8bc733 ext4: correct best extent lstart adjustment logic
When yangerkun review commit 93cdf49f6e ("ext4: Fix best extent lstart
adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best
extent did not completely cover the original request after adjusting the
best extent lstart in ext4_mb_new_inode_pa() as follows:

  original request: 2/10(8)
  normalized request: 0/64(64)
  best extent: 0/9(9)

When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical
is 2 less than the adjusted best extent logical end 9, so we think the
adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we
should determine here if the original request logical end is less than or
equal to the adjusted best extent logical end.

In addition, add a comment stating when adjusted best_ex will not cover
the original request, and remove the duplicate assertion because adjusting
lstart makes no change to b_ex.fe_len.

Link: https://lore.kernel.org/r/3630fa7f-b432-7afd-5f79-781bc3b2c5ea@huawei.com
Fixes: 93cdf49f6e ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Cc:  <stable@kernel.org>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20240201141845.1879253-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:46:39 -05:00
Ye Bin
d8b945fa47 ext4: forbid commit inconsistent quota data when errors=remount-ro
There's issue as follows When do IO fault injection test:
Quota error (device dm-3): find_block_dqentry: Quota for id 101 referenced but not present
Quota error (device dm-3): qtree_read_dquot: Can't read quota structure for id 101
Quota error (device dm-3): do_check_range: Getting block 2021161007 out of range 1-186
Quota error (device dm-3): qtree_read_dquot: Can't read quota structure for id 661

Now, ext4_write_dquot()/ext4_acquire_dquot()/ext4_release_dquot() may commit
inconsistent quota data even if process failed. This may lead to filesystem
corruption.
To ensure filesystem consistent when errors=remount-ro there is need to call
ext4_handle_error() to abort journal.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240119062908.3598806-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:40:57 -05:00
Zhang Yi
68ee261fb1 ext4: add a hint for block bitmap corrupt state in mb_groups
If one group is marked as block bitmap corrupted, its free blocks cannot
be used and its free count is also deducted from the global
sbi->s_freeclusters_counter. User might be confused about the absent
free space because we can't query the information about corrupted block
groups except unreliable error messages in syslog. So add a hint to show
block bitmap corrupted groups in mb_groups.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240119061154.1525781-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:39:38 -05:00
Cheng Nie
547e64bda9 ext4: fix the comment of ext4_map_blocks()/ext4_ext_map_blocks()
this comment of ext4_map_blocks()/ext4_ext_map_blocks() need
update after commit c21770573319("ext4: Define a new set of
flags for ext4_get_blocks()").

Signed-off-by: Cheng Nie <niecheng1@uniontech.com>
Link: https://lore.kernel.org/r/20240118062511.28276-1-niecheng1@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:21 -05:00
yangerkun
4b55d3431c ext4: improve error msg for ext4_mb_seq_groups_show
While cat mb_groups for a mounted ext4 image which has some corrupted
group, the string return to userspace was just "I/O error" which confuse
me a lot. Improve it with ext4_decode_error.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240118042557.380058-2-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:21 -05:00
yangerkun
250448802c ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show
We can just first call ext4_mb_unload_buddy, then copy information from
ext4_group_info. So remove this unused value.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240118042557.380058-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:21 -05:00
Kemeng Shi
2b81493f8e ext4: Add unit test for ext4_mb_mark_diskspace_used
Add unit test for ext4_mb_mark_diskspace_used

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240103104900.464789-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:20 -05:00
Kemeng Shi
b7098e1fa7 ext4: Add unit test for mb_free_blocks
Add unit test for mb_free_blocks.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240103104900.464789-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:20 -05:00
Kemeng Shi
ac96b56a2f ext4: Add unit test for mb_mark_used
Add unit test for mb_mark_used

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240103104900.464789-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:20 -05:00
Kemeng Shi
67d2a11b22 ext4: Add unit test of ext4_mb_generate_buddy
Add unit test of ext4_mb_generate_buddy

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240103104900.464789-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:20 -05:00
Kemeng Shi
6c5e0c9c21 ext4: Add unit test for test_free_blocks_simple
Add unit test for test_free_blocks_simple.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240103104900.464789-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21 22:33:20 -05:00
Linus Torvalds
3f24fcdacd Miscellaneous bug fixes and cleanups in ext4's multi-block allocator
and extent handling code.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmW/G4YACgkQ8vlZVpUN
 gaPTpwf/c/Fk27GV8ge9PQtR6gmir/lyw2qkvK3Z+12aEsblZRmyvElyZWjAuNQG
 bciQyltabIPOA4XxfsZOrdgYI42n0rTTFG7bmI0lr+BJM/HRw0tGGMN91FZla0FP
 EXv/AiHKCqlT5OFZbD+8n1TzfdOgWotjug1VLteXve3YKjkDgt5IQm/0Gx9hKBld
 IR8SrQlD/rYe+VPvaHz5G4u09Ne5pUE5fDj3xE23wxfU5cloVzuVRCSOGWUCTnCW
 T0v6sHeKrmiLC8tIOZkBjer4nXC0MOu0p5+geAjwOArc9VJ1Lh2eAkH+GgDOVprx
 ahdl2qmbIbacBYECIeQ/+1i78+O1yw==
 =CmYr
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Miscellaneous bug fixes and cleanups in ext4's multi-block allocator
  and extent handling code"

* tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
  ext4: make ext4_map_blocks() distinguish delalloc only extent
  ext4: add a hole extent entry in cache after punch
  ext4: correct the hole length returned by ext4_map_blocks()
  ext4: convert to exclusive lock while inserting delalloc extents
  ext4: refactor ext4_da_map_blocks()
  ext4: remove 'needed' in trace_ext4_discard_preallocations
  ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations
  ext4: remove unused return value of ext4_mb_release_group_pa
  ext4: remove unused return value of ext4_mb_release_inode_pa
  ext4: remove unused return value of ext4_mb_release
  ext4: remove unused ext4_allocation_context::ac_groups_considered
  ext4: remove unneeded return value of ext4_mb_release_context
  ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()
  ext4: remove unused return value of __mb_check_buddy
  ext4: mark the group block bitmap as corrupted before reporting an error
  ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()
  ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()
  ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt
  ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()
  ...
2024-02-04 07:33:01 +00:00
Zhang Yi
ec9d669eba ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
Since ext4_map_blocks() can recognize a delayed allocated only extent,
make ext4_set_iomap() can also recognize it, and remove the useless
separate check in ext4_iomap_begin_report().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01 23:59:21 -05:00
Zhang Yi
874eaba96d ext4: make ext4_map_blocks() distinguish delalloc only extent
Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a
delayed allocated only (not unwritten) one, and making
ext4_map_blocks() can distinguish it, no longer mixing it with holes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01 23:59:21 -05:00
Zhang Yi
9f1118223a ext4: add a hole extent entry in cache after punch
In order to cache hole extents in the extent status tree and keep the
hole length as long as possible, re-add a hole entry to the cache just
after punching a hole.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01 23:59:21 -05:00
Zhang Yi
6430dea07e ext4: correct the hole length returned by ext4_map_blocks()
In ext4_map_blocks(), if we can't find a range of mapping in the
extents cache, we are calling ext4_ext_map_blocks() to search the real
path and ext4_ext_determine_hole() to determine the hole range. But if
the querying range was partially or completely overlaped by a delalloc
extent, we can't find it in the real extent path, so the returned hole
length could be incorrect.

Fortunately, ext4_ext_put_gap_in_cache() have already handle delalloc
extent, but it searches start from the expanded hole_start, doesn't
start from the querying range, so the delalloc extent found could not be
the one that overlaped the querying range, plus, it also didn't adjust
the hole length. Let's just remove ext4_ext_put_gap_in_cache(), handle
delalloc and insert adjusted hole extent in ext4_ext_determine_hole().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01 23:47:02 -05:00
Zhang Yi
acf795dc16 ext4: convert to exclusive lock while inserting delalloc extents
ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem
when inserting delalloc extents, it could be raced by another querying
path of ext4_map_blocks() without i_rwsem, .e.g buffered read path.
Suppose we buffered read a file containing just a hole, and without any
cached extents tree, then it is raced by another delayed buffered write
to the same area or the near area belongs to the same hole, and the new
delalloc extent could be overwritten to a hole extent.

 pread()                           pwrite()
  filemap_read_folio()
   ext4_mpage_readpages()
    ext4_map_blocks()
     down_read(i_data_sem)
     ext4_ext_determine_hole()
     //find hole
     ext4_ext_put_gap_in_cache()
      ext4_es_find_extent_range()
      //no delalloc extent
                                    ext4_da_map_blocks()
                                     down_read(i_data_sem)
                                     ext4_insert_delayed_block()
                                     //insert delalloc extent
      ext4_es_insert_extent()
      //overwrite delalloc extent to hole

This race could lead to inconsistent delalloc extents tree and
incorrect reserved space counter. Fix this by converting to hold
i_data_sem in exclusive mode when adding a new delalloc extent in
ext4_da_map_blocks().

Cc: stable@vger.kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01 23:47:02 -05:00
Zhang Yi
3fcc2b887a ext4: refactor ext4_da_map_blocks()
Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary
parameters and branches, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01 23:47:02 -05:00
Kemeng Shi
f0e54b6087 ext4: remove 'needed' in trace_ext4_discard_preallocations
As 'needed' to trace_ext4_discard_preallocations is always 0 which
is meaningless. Just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:45 -05:00
Kemeng Shi
2ffd2a6ad1 ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations
The "needed" controls the number of ext4_prealloc_space to discard in
ext4_discard_preallocations. Function ext4_discard_preallocations is
supposed to discard all non-used preallocated blocks when "needed"
is 0 and now ext4_discard_preallocations is always called with "needed"
= 0. Remove unnecessary parameter "needed" and remove all non-used
preallocated spaces in ext4_discard_preallocations to simplify the
code.

Note: If count of non-used preallocated spaces could be more than
UINT_MAX, there was a memory leak as some non-used preallocated
spaces are left ununsed and this commit will fix it. Otherwise,
there is no behavior change.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:45 -05:00
Kemeng Shi
20427949b9 ext4: remove unused return value of ext4_mb_release_group_pa
Remove unused return value of ext4_mb_release_group_pa.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:45 -05:00
Kemeng Shi
820c280896 ext4: remove unused return value of ext4_mb_release_inode_pa
Remove unused return value of ext4_mb_release_inode_pa

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:45 -05:00
Kemeng Shi
908177175a ext4: remove unused return value of ext4_mb_release
Remove unused return value of ext4_mb_release.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:45 -05:00
Kemeng Shi
97c32dbffc ext4: remove unused ext4_allocation_context::ac_groups_considered
Remove unused ext4_allocation_context::ac_groups_considered

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:45 -05:00
Kemeng Shi
11fd1a5d64 ext4: remove unneeded return value of ext4_mb_release_context
Function ext4_mb_release_context always return 0 and the return value is
never used. Just remove unneeded return value of ext4_mb_release_context.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:44 -05:00
Kemeng Shi
438a35e72d ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()
Remove unused parameter ngroup in ext4_mb_choose_next_group_*().

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20240105092102.496631-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:44 -05:00
Kemeng Shi
133de5a0d8 ext4: remove unused return value of __mb_check_buddy
Remove unused return value of __mb_check_buddy.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:52:44 -05:00
Baokun Li
c5f3a3821d ext4: mark the group block bitmap as corrupted before reporting an error
Otherwise unlocking the group in ext4_grp_locked_error may allow other
processes to modify the core block bitmap that is known to be corrupt.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
832698373a ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()
Places the logic for checking if the group's block bitmap is corrupt under
the protection of the group lock to avoid allocating blocks from the group
with a corrupted block bitmap.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
4530b3660d ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()
Determine if the group block bitmap is corrupted before using ac_b_ex in
ext4_mb_try_best_found() to avoid allocating blocks from a group with a
corrupted block bitmap in the following concurrency and making the
situation worse.

ext4_mb_regular_allocator
  ext4_lock_group(sb, group)
  ext4_mb_good_group
   // check if the group bbitmap is corrupted
  ext4_mb_complex_scan_group
   // Scan group gets ac_b_ex but doesn't use it
  ext4_unlock_group(sb, group)
                           ext4_mark_group_bitmap_corrupted(group)
                           // The block bitmap was corrupted during
                           // the group unlock gap.
  ext4_mb_try_best_found
    ext4_lock_group(ac->ac_sb, group)
    ext4_mb_use_best_found
      mb_mark_used
      // Allocating blocks in block bitmap corrupted group

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
993bf0f4c3 ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt
Determine if bb_fragments is 0 instead of determining bb_free to eliminate
the risk of dividing by zero when the block bitmap is corrupted.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
2331fd4a49 ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()
After updating bb_free in mb_free_blocks, it is possible to return without
updating bb_fragments because the block being freed is found to have
already been freed, which leads to inconsistency between bb_free and
bb_fragments.

Since the group may be unlocked in ext4_grp_locked_error(), this can lead
to problems such as dividing by zero when calculating the average fragment
length. Hence move the update of bb_free to after the block double-free
check guarantees that the corresponding statistics are updated only after
the core block bitmap is modified.

Fixes: eabe0444df ("ext4: speed-up releasing blocks on commit")
CC:  <stable@vger.kernel.org> # 3.10
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
c9b528c357 ext4: regenerate buddy after block freeing failed if under fc replay
This mostly reverts commit 6bd97bf273 ("ext4: remove redundant
mb_regenerate_buddy()") and reintroduces mb_regenerate_buddy(). Based on
code in mb_free_blocks(), fast commit replay can end up marking as free
blocks that are already marked as such. This causes corruption of the
buddy bitmap so we need to regenerate it in that case.

Reported-by: Jan Kara <jack@suse.cz>
Fixes: 6bd97bf273 ("ext4: remove redundant mb_regenerate_buddy()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
172202152a ext4: do not trim the group with corrupted block bitmap
Otherwise operating on an incorrupted block bitmap can lead to all sorts
of unknown problems.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:50:24 -05:00
Baokun Li
55583e899a ext4: fix double-free of blocks due to wrong extents moved_len
In ext4_move_extents(), moved_len is only updated when all moves are
successfully executed, and only discards orig_inode and donor_inode
preallocations when moved_len is not zero. When the loop fails to exit
after successfully moving some extents, moved_len is not updated and
remains at 0, so it does not discard the preallocations.

If the moved extents overlap with the preallocated extents, the
overlapped extents are freed twice in ext4_mb_release_inode_pa() and
ext4_process_freed_data() (as described in commit 94d7c16cbb ("ext4:
Fix double-free of blocks with EXT4_IOC_MOVE_EXT")), and bb_free is
incremented twice. Hence when trim is executed, a zero-division bug is
triggered in mb_update_avg_fragment_size() because bb_free is not zero
and bb_fragments is zero.

Therefore, update move_len after each extent move to avoid the issue.

Reported-by: Wei Chen <harperchen1110@gmail.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/r/CAO4mrferzqBUnCag8R3m2zf897ts9UEuhjFQGPtODT92rYyR2Q@mail.gmail.com
Fixes: fcf6b1b729 ("ext4: refactor ext4_move_extents code base")
CC:  <stable@vger.kernel.org> # 3.18
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18 10:48:48 -05:00
Linus Torvalds
488926926a misc cleanups (the part that hadn't been picked by individual fs trees)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ/BCAAKCRBZ7Krx/gZQ
 68qqAQD6LtfYLDJGdJM+lNpyiG4BA7coYpPlJtmH7mzL+MbFPgEAnM7XsK6zyvza
 3+rEggLM0UFWjg9Ln7Nlq035TeYtFwo=
 =w1mD
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc filesystem updates from Al Viro:
 "Misc cleanups (the part that hadn't been picked by individual fs
  trees)"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  apparmorfs: don't duplicate kfree_link()
  orangefs: saner arguments passing in readdir guts
  ocfs2_find_match(): there's no such thing as NULL or negative ->d_parent
  reiserfs_add_entry(): get rid of pointless namelen checks
  __ocfs2_add_entry(), ocfs2_prepare_dir_for_insert(): namelen checks
  ext4_add_entry(): ->d_name.len is never 0
  befs: d_obtain_alias(ERR_PTR(...)) will do the right thing
  affs: d_obtain_alias(ERR_PTR(...)) will do the right thing
  /proc/sys: use d_splice_alias() calling conventions to simplify failure exits
  hostfs: use d_splice_alias() calling conventions to simplify failure exits
  udf_fiiter_add_entry(): check for zero ->d_name.len is bogus...
  udf: d_obtain_alias(ERR_PTR(...)) will do the right thing...
  udf: d_splice_alias() will do the right thing on ERR_PTR() inode
  nfsd: kill stale comment about simple_fill_super() requirements
  bfs_add_entry(): get rid of pointless ->d_name.len checks
  nilfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing...
  zonefs: d_splice_alias() will do the right thing on ERR_PTR() inode
2024-01-11 20:23:50 -08:00
Linus Torvalds
bf4e7080ae fix directory locking scheme on rename
broken in 6.5; we really can't lock two unrelated directories
 without holding ->s_vfs_rename_mutex first and in case of
 same-parent rename of a subdirectory 6.5 ends up doing just
 that.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+lyQAKCRBZ7Krx/gZQ
 60MWAP94hTqeMIpjhsUIkrTnylrIFaiw4UCWFJzIRG1QQYKqCgD/XUaWI9np7dL6
 0wR/j4CQSdJjiEFKUFE2pD3QoSuJYAQ=
 =+x0+
 -----END PGP SIGNATURE-----

Merge tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull rename updates from Al Viro:
 "Fix directory locking scheme on rename

  This was broken in 6.5; we really can't lock two unrelated directories
  without holding ->s_vfs_rename_mutex first and in case of same-parent
  rename of a subdirectory 6.5 ends up doing just that"

* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  rename(): avoid a deadlock in the case of parents having no common ancestor
  kill lock_two_inodes()
  rename(): fix the locking of subdirectories
  f2fs: Avoid reading renamed directory if parent does not change
  ext4: don't access the source subdirectory content on same-directory rename
  ext2: Avoid reading renamed directory if parent does not change
  udf_rename(): only access the child content on cross-directory rename
  ocfs2: Avoid touching renamed directory if parent does not change
  reiserfs: Avoid touching renamed directory if parent does not change
2024-01-11 20:00:22 -08:00
Linus Torvalds
0d19d9e146 Various ext4 bug fixes and cleanups for v6.8-rc1. The fixes are
mostly in the fstrim and mballoc code paths.  Also enable
 dioread_nolock in the case where the block size is less than the page
 size.  (Dioread_nolock has been default in the bs == ps case for quite
 some time.)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmWe6MMACgkQ8vlZVpUN
 gaM/gAf/e9j4yCAR/W23cICNh/9hw2U0HItEONZF7GDfySlGADL5dsOADe58jLY9
 g8UwBpHptOcyxmMTYgdKPQ2YpUF+3Kd4oi2M1Q6CjeeBeRbwuzT4lMTeKrtMEgiz
 Ns8mqBgGX3DIXjcbkdO9QdLZPBj07djamAIQlWVLHAR2w6LPgiBhHebUSe+36Ufk
 xLaj5X2nkdTtPcN1EnlTYNR+zMLyAwXUsxKf44aUveRwiNAfLGBgY9yvFby7hC+6
 ENCP1WsalvVnaI8mr9pgt1KTXIrElknA1bbiWJ9RZ5Y8Za+MEHxXBKpP/AStX8Nc
 WEo7a9tNB1AXU04+/SgVp9GAkXEViA==
 =Zk8h
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Various ext4 bug fixes and cleanups. The fixes are mostly in the
  fstrim and mballoc code paths.

  Also enable dioread_nolock in the case where the block size is less
  than the page size (dioread_nolock has been default in the bs == ps
  case for quite some time)"

* tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix inconsistent between segment fstrim and full fstrim
  ext4: fallback to complex scan if aligned scan doesn't work
  ext4: convert ext4_da_do_write_end() to take a folio
  ext4: allow for the last group to be marked as trimmed
  ext4: move ext4_check_bdev_write_error() into nojournal mode
  jbd2: abort journal when detecting metadata writeback error of fs dev
  jbd2: remove unused 'JBD2_CHECKPOINT_IO_ERROR' and 'j_atomic_flags'
  jbd2: replace journal state flag by checking errseq
  jbd2: add errseq to detect client fs's bdev writeback error
  ext4: improving calculation of 'fe_{len|start}' in mb_find_extent()
  ext4: clarify handling of unwritten bh in __ext4_block_zero_page_range()
  ext4: treat end of range as exclusive in ext4_zero_range()
  ext4: enable dioread_nolock as default for bs < ps case
  ext4: delete redundant calculations in ext4_mb_get_buddy_page_lock()
  ext4: reduce unnecessary memory allocation in alloc_flex_gd()
  ext4: avoid online resizing failures due to oversized flex bg
  ext4: remove unnecessary check from alloc_flex_gd()
  ext4: unify the type of flexbg_size to unsigned int
2024-01-10 16:09:14 -08:00
Ye Bin
68da4c44b9 ext4: fix inconsistent between segment fstrim and full fstrim
Suppose we issue two FITRIM ioctls for ranges [0,15] and [16,31] with
mininum length of trimmed range set to 8 blocks. If we have say a range of
blocks 10-22 free, this range will not be trimmed because it straddles the
boundary of the two FITRIM ranges and neither part is big enough. This is a
bit surprising to some users that call FITRIM on smaller ranges of blocks
to limit impact on the system. Also XFS trims all free space extents that
overlap with the specified range so we are inconsistent among filesystems.
Let's change ext4_try_to_trim_range() to consider for trimming the whole
free space extent that straddles the end of specified range, not just the
part of it within the range.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231216010919.1995851-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-10 13:53:36 -05:00
Ojaswin Mujoo
1f6bc02f18 ext4: fallback to complex scan if aligned scan doesn't work
Currently in case the goal length is a multiple of stripe size we use
ext4_mb_scan_aligned() to find the stripe size aligned physical blocks.
In case we are not able to find any, we again go back to calling
ext4_mb_choose_next_group() to search for a different suitable block
group. However, since the linear search always begins from the start,
most of the times we end up with the same BG and the cycle continues.

With large fliesystems, the CPU can be stuck in this loop for hours
which can slow down the whole system. Hence, until we figure out a
better way to continue the search (rather than starting from beginning)
in ext4_mb_choose_next_group(), lets just fallback to
ext4_mb_complex_scan_group() in case aligned scan fails, as it is much
more likely to find the needed blocks.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ee033f6dfa0a7f2934437008a909c3788233950f.1702455010.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-10 13:53:36 -05:00
Matthew Wilcox (Oracle)
4d5cdd757d ext4: convert ext4_da_do_write_end() to take a folio
There's nothing page-specific happening in ext4_da_do_write_end();
it's merely used for its refcount & lock, both of which are folio
properties.  Saves four calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231214053035.1018876-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-10 13:53:36 -05:00
Suraj Jitindar Singh
7c784d6248 ext4: allow for the last group to be marked as trimmed
The ext4 filesystem tracks the trim status of blocks at the group
level.  When an entire group has been trimmed then it is marked as
such and subsequent trim invocations with the same minimum trim size
will not be attempted on that group unless it is marked as able to be
trimmed again such as when a block is freed.

Currently the last group can't be marked as trimmed due to incorrect
logic in ext4_last_grp_cluster(). ext4_last_grp_cluster() is supposed
to return the zero based index of the last cluster in a group. This is
then used by ext4_try_to_trim_range() to determine if the trim
operation spans the entire group and as such if the trim status of the
group should be recorded.

ext4_last_grp_cluster() takes a 0 based group index, thus the valid
values for grp are 0..(ext4_get_groups_count - 1). Any group index
less than (ext4_get_groups_count - 1) is not the last group and must
have EXT4_CLUSTERS_PER_GROUP(sb) clusters. For the last group we need
to calculate the number of clusters based on the number of blocks in
the group. Finally subtract 1 from the number of clusters as zero
based indexing is expected.  Rearrange the function slightly to make
it clear what we are calculating and returning.

Reproducer:
// Create file system where the last group has fewer blocks than
// blocks per group
$ mkfs.ext4 -b 4096 -g 8192 /dev/nvme0n1 8191
$ mount /dev/nvme0n1 /mnt

Before Patch:
$ fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
// Group not marked as trimmed so second invocation still discards blocks
$ fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed

After Patch:
fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
// Group marked as trimmed so second invocation DOESN'T discard any blocks
fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed

Fixes: 45e4ab320c ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()")
Cc:  <stable@vger.kernel.org> # 4.19+
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213051635.37731-1-surajjs@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-10 13:53:17 -05:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Linus Torvalds
3f6984e730 vfs-6.8.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUx4wAKCRCRxhvAZXjc
 osaNAQC/c+xXVfiq/pFbuK9MQLna4RGZaGcG9k312YniXbHq0AD9HAf4aPcZwPy1
 /wkD4pauj3UZ3f0xBSyazGBvAXyN0Qc=
 =iFAQ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs super updates from Christian Brauner:
 "This contains the super work for this cycle including the long-awaited
  series by Jan to make it possible to prevent writing to mounted block
  devices:

   - Writing to mounted devices is dangerous and can lead to filesystem
     corruption as well as crashes. Furthermore syzbot comes with more
     and more involved examples how to corrupt block device under a
     mounted filesystem leading to kernel crashes and reports we can do
     nothing about. Add tracking of writers to each block device and a
     kernel cmdline argument which controls whether other writeable
     opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
     allowed.

     Note that this effectively only prevents modification of the
     particular block device's page cache by other writers. The actual
     device content can still be modified by other means - e.g. by
     issuing direct scsi commands, by doing writes through devices lower
     in the storage stack (e.g. in case loop devices, DM, or MD are
     involved) etc. But blocking direct modifications of the block
     device page cache is enough to give filesystems a chance to perform
     data validation when loading data from the underlying storage and
     thus prevent kernel crashes.

     Syzbot can use this cmdline argument option to avoid uninteresting
     crashes. Also users whose userspace setup does not need writing to
     mounted block devices can set this option for hardening. We expect
     that this will be interesting to quite a few workloads.

     Btrfs is currently opted out of this because they still haven't
     merged patches we require for this to work from three kernel
     releases ago.

   - Reimplement block device freezing and thawing as holder operations
     on the block device.

     This allows us to extend block device freezing to all devices
     associated with a superblock and not just the main device. It also
     allows us to remove get_active_super() and thus another function
     that scans the global list of superblocks.

     Freezing via additional block devices only works if the filesystem
     chooses to use @fs_holder_ops for these additional devices as well.
     That currently only includes ext4 and xfs.

     Earlier releases switched get_tree_bdev() and mount_bdev() to use
     @fs_holder_ops. The remaining nilfs2 open-coded version of
     mount_bdev() has been converted to rely on @fs_holder_ops as well.
     So block device freezing for the main block device will continue to
     work as before.

     There should be no regressions in functionality. The only special
     case is btrfs where block device freezing for the main block device
     never worked because sb->s_bdev isn't set. Block device freezing
     for btrfs can be fixed once they can switch to @fs_holder_ops but
     that can happen whenever they're ready"

* tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  block: Fix a memory leak in bdev_open_by_dev()
  super: don't bother with WARN_ON_ONCE()
  super: massage wait event mechanism
  ext4: Block writes to journal device
  xfs: Block writes to log device
  fs: Block writes to mounted block devices
  btrfs: Do not restrict writes to btrfs devices
  block: Add config option to not allow writing to mounted devices
  block: Remove blkdev_get_by_*() functions
  bcachefs: Convert to bdev_open_by_path()
  fs: handle freezing from multiple devices
  fs: remove dead check
  nilfs2: simplify device handling
  fs: streamline thaw_super_locked
  ext4: simplify device handling
  xfs: simplify device handling
  fs: simplify setup_bdev_super() calls
  blkdev: comment fs_holder_ops
  porting: document block device freeze and thaw changes
  fs: remove unused helper
  ...
2024-01-08 10:43:51 -08:00
Linus Torvalds
c604110e66 vfs-6.8.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
 ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
 KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
 =0bRl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Add Jan Kara as VFS reviewer

   - Show correct device and inode numbers in proc/<pid>/maps for vma
     files on stacked filesystems. This is now easily doable thanks to
     the backing file work from the last cycles. This comes with
     selftests

  Cleanups:

   - Remove a redundant might_sleep() from wait_on_inode()

   - Initialize pointer with NULL, not 0

   - Clarify comment on access_override_creds()

   - Rework and simplify eventfd_signal() and eventfd_signal_mask()
     helpers

   - Process aio completions in batches to avoid needless wakeups

   - Completely decouple struct mnt_idmap from namespaces. We now only
     keep the actual idmapping around and don't stash references to
     namespaces

   - Reformat maintainer entries to indicate that a given subsystem
     belongs to fs/

   - Simplify fput() for files that were never opened

   - Get rid of various pointless file helpers

   - Rename various file helpers

   - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
     last cycle

   - Make relatime_need_update() return bool

   - Use GFP_KERNEL instead of GFP_USER when allocating superblocks

   - Replace deprecated ida_simple_*() calls with their current ida_*()
     counterparts

  Fixes:

   - Fix comments on user namespace id mapping helpers. They aren't
     kernel doc comments so they shouldn't be using /**

   - s/Retuns/Returns/g in various places

   - Add missing parameter documentation on can_move_mount_beneath()

   - Rename i_mapping->private_data to i_mapping->i_private_data

   - Fix a false-positive lockdep warning in pipe_write() for watch
     queues

   - Improve __fget_files_rcu() code generation to improve performance

   - Only notify writer that pipe resizing has finished after setting
     pipe->max_usage otherwise writers are never notified that the pipe
     has been resized and hang

   - Fix some kernel docs in hfsplus

   - s/passs/pass/g in various places

   - Fix kernel docs in ntfs

   - Fix kcalloc() arguments order reported by gcc 14

   - Fix uninitialized value in reiserfs"

* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  reiserfs: fix uninit-value in comp_keys
  watch_queue: fix kcalloc() arguments order
  ntfs: dir.c: fix kernel-doc function parameter warnings
  fs: fix doc comment typo fs tree wide
  selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
  fs/proc: show correct device and inode numbers in /proc/pid/maps
  eventfd: Remove usage of the deprecated ida_simple_xx() API
  fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
  fs/hfsplus: wrapper.c: fix kernel-doc warnings
  fs: add Jan Kara as reviewer
  fs/inode: Make relatime_need_update return bool
  pipe: wakeup wr_wait after setting max_usage
  file: remove __receive_fd()
  file: stop exposing receive_fd_user()
  fs: replace f_rcuhead with f_task_work
  file: remove pointless wrapper
  file: s/close_fd_get_file()/file_close_fd()/g
  Improve __fget_files_rcu() code generation (and thus __fget_light())
  file: massage cleanup of files that failed to open
  fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
  ...
2024-01-08 10:26:08 -08:00
Zhihao Cheng
ada3fb86a3 ext4: move ext4_check_bdev_write_error() into nojournal mode
Since JBD2 takes care of all metadata writeback errors of fs dev,
ext4_check_bdev_write_error() is useful only in nojournal mode.
Move it into '!ext4_handle_valid(handle)' branch.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-6-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-04 23:42:21 -05:00