IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit b51111271b0352aa596c5ae8faf06939e91b3b68 upstream.
For a filesystem which has btrfs read-only property set to true, all
write operations including xattr should be denied. However, security
xattr can still be changed even if btrfs ro property is true.
This happens because xattr_permission() does not have any restrictions
on security.*, system.* and in some cases trusted.* from VFS and
the decision is left to the underlying filesystem. See comments in
xattr_permission() for more details.
This patch checks if the root is read-only before performing the set
xattr operation.
Testcase:
DEV=/dev/vdb
MNT=/mnt
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo "file one" > $MNT/f1
setfattr -n "security.one" -v 2 $MNT/f1
btrfs property set /mnt ro true
setfattr -n "security.one" -v 1 $MNT/f1
umount $MNT
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f2c3bec215694fb8bc0ef5010f2a758d1906fc2d upstream.
If the replace target device reappears after the suspended replace is
cancelled, it blocks the mount operation as it can't find the matching
replace-item in the metadata. As shown below,
BTRFS error (device sda5): replace devid present without an active replace item
To overcome this situation, the user can run the command
btrfs device scan --forget <replace target device>
and try the mount command again. And also, to avoid repeating the issue,
superblock on the devid=0 must be wiped.
wipefs -a device-path-to-devid=0.
This patch adds some info when this situation occurs.
Reported-by: Samuel Greiner <samuel@balkonien.org>
Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59a3991984dbc1fc47e5651a265c5200bd85464e upstream.
If the filesystem mounts with the replace-operation in a suspended state
and try to cancel the suspended replace-operation, we hit the assert. The
assert came from the commit fe97e2e173af ("btrfs: dev-replace: replace's
scrub must not be running in suspended state") that was actually not
required. So just remove it.
$ mount /dev/sda5 /btrfs
BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
$ mount -o degraded /dev/sda5 /btrfs <-- success.
$ btrfs replace cancel /btrfs
kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
kernel: ------------[ cut here ]------------
kernel: kernel BUG at fs/btrfs/ctree.h:3750!
After the patch:
$ btrfs replace cancel /btrfs
BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
Fixes: fe97e2e173af ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 47bf225a8d2cccb15f7e8d4a1ed9b757dd86afd7 upstream.
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
up returning from the function with a value of 0 (success). This happens
because the function returns the value stored in the variable 'err',
which is 0, while the error value we got from btrfs_search_slot() is
stored in the 'ret' variable.
So fix it by setting 'err' with the error value.
Fixes: 8289ed9f93bef2 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f6065f8edeb25f4a9dfe0b446030ad995a84a088 upstream.
[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)
mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
mount $dev1 $mnt
xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
sync
umount $mnt
btrfs dev scan -u $dev3
mount -o degraded $dev1 $mnt
xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
umount $mnt
btrfs dev scan
mount $dev1 $mnt
btrfs balance start --full-balance $mnt
umount $mnt
The failure is always failed to read some tree blocks:
BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
...
[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:
56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.
Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.
This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.
Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.
When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).
This explains why we have no event raid56_scrub_read_recover()
triggered.
Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.
In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.
[BTRFS DESTRUCTIVE RMW PROBLEM]
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:
0 32K 64K
Data1: | Good | Good |
Data2: | Bad | Bad |
Parity: | Good | Good |
In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:
- Btrfs will cache sectors for unrelated vertical stripes.
In above example, if we're only writing into 0~32K range, btrfs will
still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
This behavior is to cache sectors for later update.
Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
subpage compatible") has a bug which makes RAID56 to never trust the
cached sectors, thus slightly improve the situation for recovery.
Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
steal_rbio" will revert the behavior back to the old one.
- Btrfs raid56 partial write will update all P/Q sectors and cache them
This means, even if data at (64K ~ 96K) of Data2 is free space, and
only (96K ~ 128K) of Data2 is really stale data.
And we write into that (96K ~ 128K), we will update all the parity
sectors for the full stripe.
This unnecessary behavior will completely kill the chance of recovery.
Thankfully, an unrelated optimization "btrfs: only write the sectors
in the vertical stripe which has data stripes" will prevent
submitting the write bio for untouched vertical sectors.
That optimization will keep the on-disk P/Q untouched for a chance for
later recovery.
[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.
With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.
Now we really need to do is just drop all the cached sectors when doing
recovery.
By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.
In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.
The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.
So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.
But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.
Related patches:
- btrfs: only write the sectors in the vertical stripe
- d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bd8f7e627703ca5707833d623efcd43f104c7b3f upstream.
If we have only 8K partial write at the beginning of a full RAID56
stripe, we will write the following contents:
0 8K 32K 64K
Disk 1 (data): |XX| | |
Disk 2 (data): | | |
Disk 3 (parity): |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
|X| means the sector will be written back to disk.
Note that, although we won't write any sectors from disk 2, but we will
write the full 64KiB of parity to disk.
This behavior is fine for now, but not for the future (especially for
RAID56J, as we waste quite some space to journal the unused parity
stripes).
So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
queue a higher level bio into an rbio, we will update rbio::dbitmap to
indicate which vertical stripes we need to writeback.
And at finish_rmw(), we also check dbitmap to see if we need to write
any sector in the vertical stripe.
So after the patch, above example will only lead to the following
writeback pattern:
0 8K 32K 64K
Disk 1 (data): |XX| | |
Disk 2 (data): | | |
Disk 3 (parity): |XX| | |
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5fa2cffba0b82336a2244d941322eb1627ff787b ]
Coverity complains about assigning a pointer based on
value length before checking that value length goes
beyond the end of the SMB. Although this is even more
unlikely as value length is a single byte, and the
pointer is not dereferenced until laterm, it is clearer
to check the lengths first.
Addresses-Coverity: 1467704 ("Speculative execution data leak")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 141170b759e03958f296033bb7001be62d1d363b ]
As Dipanjan Das <mail.dipanjan.das@gmail.com> reported, syzkaller
found a f2fs bug as below:
RIP: 0010:f2fs_new_node_page+0x19ac/0x1fc0 fs/f2fs/node.c:1295
Call Trace:
write_all_xattrs fs/f2fs/xattr.c:487 [inline]
__f2fs_setxattr+0xe76/0x2e10 fs/f2fs/xattr.c:743
f2fs_setxattr+0x233/0xab0 fs/f2fs/xattr.c:790
f2fs_xattr_generic_set+0x133/0x170 fs/f2fs/xattr.c:86
__vfs_setxattr+0x115/0x180 fs/xattr.c:182
__vfs_setxattr_noperm+0x125/0x5f0 fs/xattr.c:216
__vfs_setxattr_locked+0x1cf/0x260 fs/xattr.c:277
vfs_setxattr+0x13f/0x330 fs/xattr.c:303
setxattr+0x146/0x160 fs/xattr.c:611
path_setxattr+0x1a7/0x1d0 fs/xattr.c:630
__do_sys_lsetxattr fs/xattr.c:653 [inline]
__se_sys_lsetxattr fs/xattr.c:649 [inline]
__x64_sys_lsetxattr+0xbd/0x150 fs/xattr.c:649
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
NAT entry and nat bitmap can be inconsistent, e.g. one nid is free
in nat bitmap, and blkaddr in its NAT entry is not NULL_ADDR, it
may trigger BUG_ON() in f2fs_new_node_page(), fix it.
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 69cb8e9d8cd97cdf5e293b26d70a9dee3e35e6bd ]
This patch avoids an attempt to resize the filesystem to an
unaligned cluster boundary. An online resize to a size that is not
integral to cluster size results in the last iteration attempting to
grow the fs by a negative amount, which trips a BUG_ON and leaves the fs
with a corrupted in-memory superblock.
Signed-off-by: Oleg Kiselev <okiselev@amazon.com>
Link: https://lore.kernel.org/r/0E92A0AB-4F16-4F1A-94B7-702CC6504FDE@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b24e77ef1c6d4dbf42749ad4903c97539cc9755a ]
Now if check directoy entry is corrupted, ext4_empty_dir may return true
then directory will be removed when file system mounted with "errors=continue".
In order not to make things worse just return false when directory is corrupted.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220622090223.682234-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2135e5d56278ffdb1c2e6d325dc6b87f669b9dac upstream.
If someone cancels the open RPC call, then we must not try to free
either the open slot or the layoutget operation arguments, since they
are likely still in use by the hung RPC call.
Fixes: 6949493884fe ("NFSv4: Don't hold the layoutget locks across multiple RPC calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e35a5e782f67ed76a65ad0f23a484444a95f000f upstream.
A client should be able to handle getting an EACCES error while doing
a mount operation to reclaim state due to NFS4CLNT_RECLAIM_REBOOT
being set. If the server returns RPC_AUTH_BADCRED because authentication
failed when we execute "exportfs -au", then RECLAIM_COMPLETE will go a
wrong way. After mount succeeds, all OPEN call will fail due to an
NFS4ERR_GRACE error being returned. This patch is to fix it by resending
a RPC request.
Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Fixes: aa5190d0ed7d ("NFSv4: Kill nfs4_async_handle_error() abuses by NFSv4.1")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51fd2eb52c0ca8275a906eed81878ef50ae94eb0 upstream.
nfs_idmap_instantiate() will cause the process that is waiting in
request_key_with_auxdata() to wake up and exit. If there is a second
process waiting for the idmap->idmap_mutex, then it may wake up and
start a new call to request_key_with_auxdata(). If the call to
idmap_pipe_downcall() from the first process has not yet finished
calling nfs_idmap_complete_pipe_upcall_locked(), then we may end up
triggering the WARN_ON_ONCE() in nfs_idmap_prepare_pipe_upcall().
The fix is to ensure that we clear idmap->idmap_upcall_data before
calling nfs_idmap_instantiate().
Fixes: e9ab41b620e4 ("NFSv4: Clean up the legacy idmapper upcall")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7ccafd4b2b9f34e6d8185f796f151c47424e273e upstream.
Don't assume that the NFS4ERR_DELAY means that the server is processing
this slot id.
Fixes: 3453d5708b33 ("NFSv4.1: Avoid false retries when RPC calls are interrupted")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f07a5d2427fc113dc50c5c818eba8929bc27b8ca upstream.
When we're trying to figure out what the server may or may not have seen
in terms of request numbers, do not assume that requests with a larger
number were missed, just because we saw a reply to a request with a
smaller number.
Fixes: 3453d5708b33 ("NFSv4.1: Avoid false retries when RPC calls are interrupted")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7a6b75b79902e47f46328b57733f2604774fa2d9 upstream.
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.
So just make __add_inode_ref() return an error it gets from the extended
reference lookup.
Fixes: f186373fef005c ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc4d31684974d140250f3ee612c3f0cab13b3146 upstream.
[BUG]
If we have a btrfs image with dirty log, along with an unsupported RO
compatible flag:
log_root 30474240
...
compat_flags 0x0
compat_ro_flags 0x40000003
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x40000000 )
Then even if we can only mount it RO, we will still cause metadata
update for log replay:
BTRFS info (device dm-1): flagging fs with big metadata feature
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): has skinny extents
BTRFS info (device dm-1): start tree-log replay
This is definitely against RO compact flag requirement.
[CAUSE]
RO compact flag only forces us to do RO mount, but we will still do log
replay for plain RO mount.
Thus this will result us to do log replay and update metadata.
This can be very problematic for new RO compat flag, for example older
kernel can not understand v2 cache, and if we allow metadata update on
RO mount and invalidate/corrupt v2 cache.
[FIX]
Just reject the mount unless rescue=nologreplay is provided:
BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
We don't want to set rescue=nologreply directly, as this would make the
end user to read the old data, and cause confusion.
Since the such case is really rare, we're mostly fine to just reject the
mount with an error message, which also includes the proper workaround.
CC: stable@vger.kernel.org #4.9+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fd7e672ea98b95b9d4c9dae316639f03c16a749d upstream.
Use the EXT4_INODE_HAS_XATTR_SPACE macro to more accurately
determine whether the inode have xattr space.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c9fd167d57133c5b748d16913c4eabc55e531c73 upstream.
If the ext4 inode does not have xattr space, 0 is returned in the
get_max_inline_xattr_value_size function. Otherwise, the function returns
a negative value when the inode does not contain EXT4_STATE_XATTR.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7f0d8e1d607c1a4fa9a27362a108921d82230874 upstream.
A race can occur in the unlikely event ext4 is unable to allocate a
physical cluster for a delayed allocation in a bigalloc file system
during writeback. Failure to allocate a cluster forces error recovery
that includes a call to mpage_release_unused_pages(). That function
removes any corresponding delayed allocated blocks from the extent
status tree. If a new delayed write is in progress on the same cluster
simultaneously, resulting in the addition of an new extent containing
one or more blocks in that cluster to the extent status tree, delayed
block accounting can be thrown off if that delayed write then encounters
a similar cluster allocation failure during future writeback.
Write lock the i_data_sem in mpage_release_unused_pages() to fix this
problem. Ext4's block/cluster accounting code for bigalloc relies on
i_data_sem for mutual exclusion, as is found in the delayed write path,
and the locking in mpage_release_unused_pages() is missing.
Cc: stable@kernel.org
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20220615160530.1928801-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit de394a86658ffe4e89e5328fd4993abfe41b7435 upstream.
When doing an online resize, the on-disk superblock on-disk wasn't
updated. This means that when the file system is unmounted and
remounted, and the on-disk overhead value is non-zero, this would
result in the results of statfs(2) to be incorrect.
This was partially fixed by Commits 10b01ee92df5 ("ext4: fix overhead
calculation to account for the reserved gdt blocks"), 85d825dbf489
("ext4: force overhead calculation if the s_overhead_cluster makes no
sense"), and eb7054212eac ("ext4: update the cached overhead value in
the superblock").
However, since it was too expensive to forcibly recalculate the
overhead for bigalloc file systems at every mount, this didn't fix the
problem for bigalloc file systems. This commit should address the
problem when resizing file systems with the bigalloc feature enabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220629040026.112371-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b8a04fe77ef1360fbf73c80fddbdfeaa9407ed1b upstream.
ext4_append() must always allocate a new block, otherwise we run the
risk of overwriting existing directory block corrupting the directory
tree in the process resulting in all manner of problems later on.
Add a sanity check to see if the logical block is already allocated and
error out if it is.
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 179b14152dcb6a24c3415200603aebca70ff13af upstream.
When adding an xattr to an inode, we must ensure that the inode_size is
not less than EXT4_GOOD_OLD_INODE_SIZE + extra_isize + pad. Otherwise,
the end position may be greater than the start position, resulting in UAF.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220616021358.2504451-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1314ca78b2c35d3e7d0f097268a2ee6dc0d369ef upstream.
If you try to force a chunk allocation, but you race with another chunk
allocation, you will end up waiting on the chunk allocation that just
occurred and then allocate another chunk. If you have many threads all
doing this at once you can way over-allocate chunks.
Fix this by resetting force to NO_FORCE, that way if we think we need to
allocate we can, otherwise we don't force another chunk allocation if
one is already happening.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c64797809a64c73497082aa05e401a062ec1af34 ]
The commit 15c8e72e88e0 ("fuse: allow skipping control interface and forced
unmount") tries to remove the control interface for virtio-fs since it does
not support aborting requests which are being processed. But it doesn't
work now.
This patch fixes it by skipping creating the control interface if
fuse_conn->no_control is set.
Fixes: 15c8e72e88e0 ("fuse: allow skipping control interface and forced unmount")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 07ea7a617d6b278fb7acedb5cbe1a81ce2de7d0c ]
When migrating to extents, the checksum seed of temporary inode
need to be replaced by inode's, otherwise the inode checksums
will be incorrect when swapping the inodes data.
However, the temporary inode can not match it's checksum to
itself since it has lost it's own checksum seed.
mkfs.ext4 -F /dev/sdc
mount /dev/sdc /mnt/sdc
xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/sdc/testfile
chattr -e /mnt/sdc/testfile
chattr +e /mnt/sdc/testfile
umount /dev/sdc
fsck -fn /dev/sdc
========
...
Pass 1: Checking inodes, blocks, and sizes
Inode 13 passes checks, but checksum does not match inode. Fix? no
...
========
The fix is simple, save the checksum seed of temporary inode, and
recover it after migrating to extents.
Fixes: e81c9302a6c3 ("ext4: set csum seed in tmp inode while migrating to extents")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220617062515.2113438-1-lilingfeng3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a89573ce4ad32f19f43ec669771726817e185be0 ]
We catch an assert problem in jbd2_journal_commit_transaction() when
doing fsstress and request falut injection tests. The problem is
happened in a race condition between jbd2_journal_commit_transaction()
and ext4_end_io_end(). Firstly, ext4_writepages() writeback dirty pages
and start reserved handle, and then the journal was aborted due to some
previous metadata IO error, jbd2_journal_abort() start to commit current
running transaction, the committing procedure could be raced by
ext4_end_io_end() and lead to subtract j_reserved_credits twice from
commit_transaction->t_outstanding_credits, finally the
t_outstanding_credits is mistakenly smaller than t_nr_buffers and
trigger assert.
kjournald2 kworker
jbd2_journal_commit_transaction()
write_unlock(&journal->j_state_lock);
atomic_sub(j_reserved_credits, t_outstanding_credits); //sub once
jbd2_journal_start_reserved()
start_this_handle() //detect aborted journal
jbd2_journal_free_reserved() //get running transaction
read_lock(&journal->j_state_lock)
__jbd2_journal_unreserve_handle()
atomic_sub(j_reserved_credits, t_outstanding_credits);
//sub again
read_unlock(&journal->j_state_lock);
journal->j_running_transaction = NULL;
J_ASSERT(t_nr_buffers <= t_outstanding_credits) //bomb!!!
Fix this issue by using journal->j_state_lock to protect the subtraction
in jbd2_journal_commit_transaction().
Fixes: 96f1e0974575 ("jbd2: avoid long hold times of j_state_lock while committing a transaction")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220611130426.2013258-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 97ef77c52b789ec1411d360ed99dca1efe4b2c81 ]
The original direct splicing mechanism from Jens required the input to
be a regular file because it was avoiding the special socket case. It
also recognized blkdevs as being close enough to a regular file. But it
forgot about chardevs, which behave the same way and work fine here.
This is an okayish heuristic, but it doesn't totally work. For example,
a few chardevs should be spliceable here. And a few regular files
shouldn't. This patch fixes this by instead checking whether FMODE_LSEEK
is set, which represents decently enough what we need rewinding for when
splicing to internal pipes.
Fixes: b92ce5589374 ("[PATCH] splice: add direct fd <-> fd splicing support")
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 448b5a1548d87c246c3d0c3df8480d3c6eb6c11a ]
Currently, vmap()s are avoided if physical addresses are
consecutive for decompressed buffers.
I observed that is very common for 4KiB pclusters since the
numbers of decompressed pages are almost 2 or 3.
However, such detection doesn't work for Highmem pages on
32-bit machines, let's fix it now.
Reported-by: Liu Jinbao <liujinbao1@xiaomi.com>
Fixes: 7fc45dbc938a ("staging: erofs: introduce generic decompression backend")
Link: https://lore.kernel.org/r/20220708101001.21242-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fa78f336937240d1bc598db817d638086060e7e9 ]
Add checks verifying number of inodes stored in the superblock matches
the number computed from number of inodes per group. Also verify we have
at least one block worth of inodes per group. This prevents crashes on
corrupted filesystems.
Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a16ceb13961068f7209e34d7984f8e42d2c06159 upstream.
If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep->wq. Then when network
traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will
not remove the entries from the list.
This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying to
actually make progress and remove themselves from the wq.
Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out->eavail heuristic, which now can
race and result in a redundant ep_send_events attempt. (But only when
incoming data and a timeout actually race, not on every timeout)
Shakeel added:
: We are seeing this issue in production with real workloads and it has
: caused hard lockups. Particularly network heavy workloads with a lot
: of threads in epoll_wait() can easily trigger this issue if they get
: killed (oom-killed in our case).
Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com
Signed-off-by: Ben Segall <bsegall@google.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Heiher <r@hev.cc>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dd524b7f317de8d31d638cbfdc7be4cf9b770e42 upstream.
Some code paths cannot guarantee the inode have any dentry alias. So
WARN_ON() all !dentry may flood the kernel logs.
For example, when an overlayfs inode is watched by inotifywait (1), and
someone is trying to read the /proc/$(pidof inotifywait)/fdinfo/INOTIFY_FD,
at that time if the dentry has been reclaimed by kernel (such as
echo 2 > /proc/sys/vm/drop_caches), there will be a WARN_ON(). The
printed call stack would be like:
? show_mark_fhandle+0xf0/0xf0
show_mark_fhandle+0x4a/0xf0
? show_mark_fhandle+0xf0/0xf0
? seq_vprintf+0x30/0x50
? seq_printf+0x53/0x70
? show_mark_fhandle+0xf0/0xf0
inotify_fdinfo+0x70/0x90
show_fdinfo.isra.4+0x53/0x70
seq_show+0x130/0x170
seq_read+0x153/0x440
vfs_read+0x94/0x150
ksys_read+0x5f/0xe0
do_syscall_64+0x59/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
So let's drop WARN_ON() to avoid kernel log flooding.
Reported-by: Hongbo Yin <yinhongbo@bytedance.com>
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Tianci Zhang <zhangtianci.1997@bytedance.com>
Fixes: 8ed5eec9d6c4 ("ovl: encode pure upper file handles")
Cc: <stable@vger.kernel.org> # v4.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ac6800e279a22b28f4fc21439843025a0d5bf03e upstream.
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.
Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com
Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e2ebff9c57fe4eb104ce4768f6ebcccf76bef849 upstream.
If something manages to set the maximum file size to MAX_OFFSET+1, this
can cause the xfs and ext4 filesystems at least to become corrupt.
Ordinarily, the kernel protects against userspace trying this by
checking the value early in the truncate() and ftruncate() system calls
calls - but there are at least two places that this check is bypassed:
(1) Cachefiles will round up the EOF of the backing file to DIO block
size so as to allow DIO on the final block - but this might push
the offset negative. It then calls notify_change(), but this
inadvertently bypasses the checking. This can be triggered if
someone puts an 8EiB-1 file on a server for someone else to try and
access by, say, nfs.
(2) ksmbd doesn't check the value it is given in set_end_of_file_info()
and then calls vfs_truncate() directly - which also bypasses the
check.
In both cases, it is potentially possible for a network filesystem to
cause a disk filesystem to be corrupted: cachefiles in the client's
cache filesystem; ksmbd in the server's filesystem.
nfsd is okay as it checks the value, but we can then remove this check
too.
Fix this by adding a check to inode_newsize_ok(), as called from
setattr_prepare(), thereby catching the issue as filesystems set up to
perform the truncate with minimal opportunity for bypassing the new
check.
Fixes: 1f08c925e7a3 ("cachefiles: Implement backing file wrangling")
Fixes: f44158485826 ("cifsd: add file operations")
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Cc: stable@kernel.org
Acked-by: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Steve French <sfrench@samba.org>
cc: Hyunchul Lee <hyc.lee@gmail.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ba58995909b5098ca4003af65b0ccd5a8d13dd25 ]
This patch unsets ls_remove_len and ls_remove_name if a message
allocation of a remove messages fails. In this case we never send a
remove message out but set the per ls ls_remove_len ls_remove_name
variable for a pending remove. Unset those variable should indicate
possible waiters in wait_pending_remove() that no pending remove is
going on at this moment.
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5924e6ec1585445f251ea92713eb15beb732622a upstream.
The permission flags of newly created symlinks are wrongly dropped on
nilfs2 with the current umask value even though symlinks should have 777
(rwxrwxrwx) permissions:
$ umask
0022
$ touch file && ln -s file symlink; ls -l file symlink
-rw-r--r--. 1 root root 0 Jun 23 16:29 file
lrwxr-xr-x. 1 root root 4 Jun 23 16:29 symlink -> file
This fixes the bug by inserting a missing check that excludes
symlinks.
Link: https://lkml.kernel.org/r/1655974441-5612-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Tommy Pettersson <ptp@lysator.liu.se>
Reported-by: Ciprian Craciun <ciprian.craciun@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e445976537ad139162980bee015b7364e5b64fff upstream.
This ASSERT in xfs_rename is a) incorrect, because
(RENAME_WHITEOUT|RENAME_NOREPLACE) is a valid combination, and
b) unnecessary, because actual invalid flag combinations are already
handled at the vfs level in do_renameat2() before we get called.
So, remove it.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Fixes: 7dcf5c3e4527 ("xfs: add RENAME_WHITEOUT support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cb78d1b5efffe4cf97e16766329dd7358aed3deb ]
The recent patch to make afs_getattr consult the server didn't account
for the pseudo-inodes employed by the dynamic root-type afs superblock
not having a volume or a server to access, and thus an oops occurs if
such a directory is stat'd.
Fix this by checking to see if the vnode->volume pointer actually points
anywhere before following it in afs_getattr().
This can be tested by stat'ing a directory in /afs. It may be
sufficient just to do "ls /afs" and the oops looks something like:
BUG: kernel NULL pointer dereference, address: 0000000000000020
...
RIP: 0010:afs_getattr+0x8b/0x14b
...
Call Trace:
<TASK>
vfs_statx+0x79/0xf5
vfs_fstatat+0x49/0x62
Fixes: 2aeb8c86d499 ("afs: Fix afs_getattr() to refetch file status if callback break occurred")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/165408450783.1031787.7941404776393751186.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit b55c3cd102a6f48b90e61c44f7f3dda8c290c694 upstream.
We capture a NULL pointer issue when resizing a corrupt ext4 image which
is freshly clear resize_inode feature (not run e2fsck). It could be
simply reproduced by following steps. The problem is because of the
resize_inode feature was cleared, and it will convert the filesystem to
meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was
not reduced to zero, so could we mistakenly call reserve_backup_gdb()
and passing an uninitialized resize_inode to it when adding new group
descriptors.
mkfs.ext4 /dev/sda 3G
tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck
mount /dev/sda /mnt
resize2fs /dev/sda 8G
========
BUG: kernel NULL pointer dereference, address: 0000000000000028
CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748
...
RIP: 0010:ext4_flex_group_add+0xe08/0x2570
...
Call Trace:
<TASK>
ext4_resize_fs+0xbec/0x1660
__ext4_ioctl+0x1749/0x24e0
ext4_ioctl+0x12/0x20
__x64_sys_ioctl+0xa6/0x110
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2dd739617b
========
The fix is simple, add a check in ext4_resize_begin() to make sure that
the es->s_reserved_gdt_blocks is zero when the resize_inode feature is
disabled.
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220601092717.763694-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc75a6eb856cb1507fa907bf6c1eda91b3fef52f upstream.
Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to
be a signed integer so we can correctly check for an error code returned
by dx_make_map().
Fixes: 46c116b920eb ("ext4: verify dir block before splitting it")
Cc: stable@kernel.org
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a08f789d2ab5242c07e716baf9a835725046be89 upstream.
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
ext4_mb_new_blocks+0x9df/0x5d30
ext4_ext_map_blocks+0x1803/0x4d80
ext4_map_blocks+0x3a4/0x1a10
ext4_writepages+0x126d/0x2c30
do_writepages+0x7f/0x1b0
__filemap_fdatawrite_range+0x285/0x3b0
file_write_and_wait_range+0xb1/0x140
ext4_sync_file+0x1aa/0xca0
vfs_fsync_range+0xfb/0x260
do_fsync+0x48/0xa0
[...]
==================================================================
Above issue may happen as follows:
-------------------------------------
do_fsync
vfs_fsync_range
ext4_sync_file
file_write_and_wait_range
__filemap_fdatawrite_range
do_writepages
ext4_writepages
mpage_map_and_submit_extent
mpage_map_one_extent
ext4_map_blocks
ext4_mb_new_blocks
ext4_mb_normalize_request
>>> start + size <= ac->ac_o_ex.fe_logical
ext4_mb_regular_allocator
ext4_mb_simple_scan_group
ext4_mb_use_best_found
ext4_mb_new_preallocation
ext4_mb_new_inode_pa
ext4_mb_use_inode_pa
>>> set ac->ac_b_ex.fe_len <= 0
ext4_mb_mark_diskspace_used
>>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
we can easily reproduce this problem with the following commands:
`fallocate -l100M disk`
`mkfs.ext4 -b 1024 -g 256 disk`
`mount disk /mnt`
`fsstress -d /mnt -l 0 -n 1000 -p 1`
The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.
Cc: stable@kernel.org
Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fe44fb23d6ccde4c914c44ef74ab8d9d9ba02bea ]
If the server tells us that a pNFS layout is not available for a
specific file, then we should not keep pounding it with further
layoutget requests.
Fixes: 183d9e7b112a ("pnfs: rework LAYOUTGET retry handling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 507e4e2b430b6a27b66f4745564ecaee7967737f upstream.
These are all handled by the random driver, so instead of listing
each ioctl, we can use the generic compat_ptr_ioctl() helper.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b577d0cd2104fdfcf0ded3707540a12be8ddd8b0 upstream.
In commit 45089142b149 Aneesh had missed one (admittedly, very unlikely
to hit) case in v9fs_stat2inode_dotl(). However, the same considerations
apply there as well - we have no business whatsoever to change ->i_rdev
or the file type.
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8ea21823aa584b55ba4b861307093b78054b0c1b upstream.
During reconnects, we check the return value from
cifs_negotiate_protocol, and have handlers for both success
and failures. But if that passes, and cifs_setup_session
returns any errors other than -EACCES, we do not handle
that. This fix adds a handler for that, so that we don't
go ahead and try a tree_connect on a failed session.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>