74189 Commits

Author SHA1 Message Date
Jan Kara
5bc0b2fda4 ext4: fix deadlock due to mbcache entry corruption
[ Upstream commit a44e84a9b7764c72896f7241a0ec9ac7e7ef38dd ]

When manipulating xattr blocks, we can deadlock infinitely looping
inside ext4_xattr_block_set() where we constantly keep finding xattr
block for reuse in mbcache but we are unable to reuse it because its
reference count is too big. This happens because cache entry for the
xattr block is marked as reusable (e_reusable set) although its
reference count is too big. When this inconsistency happens, this
inconsistent state is kept indefinitely and so ext4_xattr_block_set()
keeps retrying indefinitely.

The inconsistent state is caused by non-atomic update of e_reusable bit.
e_reusable is part of a bitfield and e_reusable update can race with
update of e_referenced bit in the same bitfield resulting in loss of one
of the updates. Fix the problem by using atomic bitops instead.

This bug has been around for many years, but it became *much* easier
to hit after commit 65f8b80053a1 ("ext4: fix race when reusing xattr
blocks").

Cc: stable@vger.kernel.org
Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
Fixes: 65f8b80053a1 ("ext4: fix race when reusing xattr blocks")
Reported-and-tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Reported-by: Thilo Fromm <t-lo@linux.microsoft.com>
Link: https://lore.kernel.org/r/c77bf00f-4618-7149-56f1-b8d1664b9d07@linux.microsoft.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20221123193950.16758-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:07 +01:00
Jan Kara
a6e4094faf mbcache: automatically delete entries from cache on freeing
[ Upstream commit 307af6c879377c1c63e71cbdd978201f9c7ee8df ]

Use the fact that entries with elevated refcount are not removed from
the hash and just move removal of the entry from the hash to the entry
freeing time. When doing this we also change the generic code to hold
one reference to the cache entry, not two of them, which makes code
somewhat more obvious.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: a44e84a9b776 ("ext4: fix deadlock due to mbcache entry corruption")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:07 +01:00
Baokun Li
1872549129 ext4: correct inconsistent error msg in nojournal mode
[ Upstream commit 89481b5fa8c0640e62ba84c6020cee895f7ac643 ]

When we used the journal_async_commit mounting option in nojournal mode,
the kernel told me that "can't mount with journal_checksum", was very
confusing. I find that when we mount with journal_async_commit, both the
JOURNAL_ASYNC_COMMIT and EXPLICIT_JOURNAL_CHECKSUM flags are set. However,
in the error branch, CHECKSUM is checked before ASYNC_COMMIT. As a result,
the above inconsistency occurs, and the ASYNC_COMMIT branch becomes dead
code that cannot be executed. Therefore, we exchange the positions of the
two judgments to make the error msg more accurate.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221109074343.4184862-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:07 +01:00
Jason Yan
761f88f82e ext4: goto right label 'failed_mount3a'
[ Upstream commit 43bd6f1b49b61f43de4d4e33661b8dbe8c911f14 ]

Before these two branches neither loaded the journal nor created the
xattr cache. So the right label to goto is 'failed_mount3a'. Although
this did not cause any issues because the error handler validated if the
pointer is null. However this still made me confused when reading
the code. So it's still worth to modify to goto the right label.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-2-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 89481b5fa8c0 ("ext4: correct inconsistent error msg in nojournal mode")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:07 +01:00
void0red
b8e7ed42bc btrfs: fix extent map use-after-free when handling missing device in read_one_chunk
[ Upstream commit 1742e1c90c3da344f3bb9b1f1309b3f47482756a ]

Store the error code before freeing the extent_map. Though it's
reference counted structure, in that function it's the first and last
allocation so this would lead to a potential use-after-free.

The error can happen eg. when chunk is stored on a missing device and
the degraded mount option is missing.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721
Reported-by: eriri <1527030098@qq.com>
Fixes: adfb69af7d8c ("btrfs: add_missing_dev() should return the actual error")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: void0red <void0red@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:05 +01:00
Nikolay Borisov
9c3beebd21 btrfs: move missing device handling in a dedicate function
[ Upstream commit ff37c89f94be14b0e22a532d1e6d57187bfd5bb8 ]

This simplifies the code flow in read_one_chunk and makes error handling
when handling missing devices a bit simpler by reducing it to a single
check if something went wrong. No functional changes.

Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 1742e1c90c3d ("btrfs: fix extent map use-after-free when handling missing device in read_one_chunk")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:05 +01:00
Sasha Levin
7528b21ceb btrfs: replace strncpy() with strscpy()
[ Upstream commit 63d5429f68a3d4c4aa27e65a05196c17f86c41d6 ]

Using strncpy() on NUL-terminated strings are deprecated.  To avoid
possible forming of non-terminated string strscpy() should be used.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 11:59:05 +01:00
Eric Biggers
18f28f1330 ext4: fix off-by-one errors in fast-commit block filling
From: Eric Biggers <ebiggers@google.com>

commit 48a6a66db82b8043d298a630f22c62d43550cae5 upstream.

Due to several different off-by-one errors, or perhaps due to a late
change in design that wasn't fully reflected in the code that was
actually merged, there are several very strange constraints on how
fast-commit blocks are filled with tlv entries:

- tlvs must start at least 10 bytes before the end of the block, even
  though the minimum tlv length is 8.  Otherwise, the replay code will
  ignore them.  (BUG: ext4_fc_reserve_space() could violate this
  requirement if called with a len of blocksize - 9 or blocksize - 8.
  Fortunately, this doesn't seem to happen currently.)

- tlvs must end at least 1 byte before the end of the block.  Otherwise
  the replay code will consider them to be invalid.  This quirk
  contributed to a bug (fixed by an earlier commit) where uninitialized
  memory was being leaked to disk in the last byte of blocks.

Also, strangely these constraints don't apply to the replay code in
e2fsprogs, which will accept any tlvs in the blocks (with no bounds
checks at all, but that is a separate issue...).

Given that this all seems to be a bug, let's fix it by just filling
blocks with tlv entries in the natural way.

Note that old kernels will be unable to replay fast-commit journals
created by kernels that have this commit.

Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-7-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:05 +01:00
Eric Biggers
b205332b6b ext4: fix unaligned memory access in ext4_fc_reserve_space()
From: Eric Biggers <ebiggers@google.com>

commit 8415ce07ecf0cc25efdd5db264a7133716e503cf upstream.

As is done elsewhere in the file, build the struct ext4_fc_tl on the
stack and memcpy() it into the buffer, rather than directly writing it
to a potentially-unaligned location in the buffer.

Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-6-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:05 +01:00
Eric Biggers
9c197dcbac ext4: add missing validation of fast-commit record lengths
From: Eric Biggers <ebiggers@google.com>

commit 64b4a25c3de81a69724e888ec2db3533b43816e2 upstream.

Validate the inode and filename lengths in fast-commit journal records
so that a malicious fast-commit journal cannot cause a crash by having
invalid values for these.  Also validate EXT4_FC_TAG_DEL_RANGE.

Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-5-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:05 +01:00
Eric Biggers
6220ec4055 ext4: don't set up encryption key during jbd2 transaction
From: Eric Biggers <ebiggers@google.com>

commit 4c0d5778385cb3618ff26a561ce41de2b7d9de70 upstream.

Commit a80f7fcf1867 ("ext4: fixup ext4_fc_track_* functions' signature")
extended the scope of the transaction in ext4_unlink() too far, making
it include the call to ext4_find_entry().  However, ext4_find_entry()
can deadlock when called from within a transaction because it may need
to set up the directory's encryption key.

Fix this by restoring the transaction to its original scope.

Reported-by: syzbot+1a748d0007eeac3ab079@syzkaller.appspotmail.com
Fixes: a80f7fcf1867 ("ext4: fixup ext4_fc_track_* functions' signature")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:04 +01:00
Eric Biggers
6482d42baf ext4: disable fast-commit of encrypted dir operations
From: Eric Biggers <ebiggers@google.com>

commit 0fbcb5251fc81b58969b272c4fb7374a7b922e3e upstream.

fast-commit of create, link, and unlink operations in encrypted
directories is completely broken because the unencrypted filenames are
being written to the fast-commit journal instead of the encrypted
filenames.  These operations can't be replayed, as encryption keys
aren't present at journal replay time.  It is also an information leak.

Until if/when we can get this working properly, make encrypted directory
operations ineligible for fast-commit.

Note that fast-commit operations on encrypted regular files continue to
be allowed, as they seem to work.

Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:04 +01:00
Eric Biggers
6969367c15 ext4: fix potential out of bound read in ext4_fc_replay_scan()
From: Ye Bin <yebin10@huawei.com>

commit 1b45cc5c7b920fd8bf72e5a888ec7abeadf41e09 upstream.

For scan loop must ensure that at least EXT4_FC_TAG_BASE_LEN space. If remain
space less than EXT4_FC_TAG_BASE_LEN which will lead to out of bound read
when mounting corrupt file system image.
ADD_RANGE/HEAD/TAIL is needed to add extra check when do journal scan, as this
three tags will read data during scan, tag length couldn't less than data length
which will read.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-4-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:04 +01:00
Eric Biggers
818175ae3b ext4: factor out ext4_fc_get_tl()
From: Ye Bin <yebin10@huawei.com>

commit dcc5827484d6e53ccda12334f8bbfafcc593ceda upstream.

Factor out ext4_fc_get_tl() to fill 'tl' with host byte order.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:04 +01:00
Eric Biggers
ffd84d0bc5 ext4: introduce EXT4_FC_TAG_BASE_LEN helper
From: Ye Bin <yebin10@huawei.com>

commit fdc2a3c75dd8345c5b48718af90bad1a7811bedb upstream.

Introduce EXT4_FC_TAG_BASE_LEN helper for calculate length of
struct ext4_fc_tl.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:04 +01:00
Eric Biggers
37914e029b ext4: use ext4_debug() instead of jbd_debug()
From: Jan Kara <jack@suse.cz>

commit 4978c659e7b5c1926cdb4b556e4ca1fd2de8ad42 upstream.

We use jbd_debug() in some places in ext4. It seems a bit strange to use
jbd2 debugging output function for ext4 code. Also these days
ext4_debug() uses dynamic printk so each debug message can be enabled /
disabled on its own so the time when it made some sense to have these
combined (to allow easier common selecting of messages to report) has
passed. Just convert all jbd_debug() uses in ext4 to ext4_debug().

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:04 +01:00
Eric Biggers
b0ed9a032e ext4: remove unused enum EXT4_FC_COMMIT_FAILED
From: Ritesh Harjani <riteshh@linux.ibm.com>

commit c864ccd182d6ff2730a0f5b636c6b7c48f6f4f7f upstream.

Below commit removed all references of EXT4_FC_COMMIT_FAILED.
commit 0915e464cb274 ("ext4: simplify updating of fast commit stats")

Just remove it since it is not used anymore.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/c941357e476be07a1138c7319ca5faab7fb80fc6.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:03 +01:00
Ye Bin
2771c7a0ee ext4: allocate extended attribute value in vmalloc area
commit cc12a6f25e07ed05d5825a1664b67a970842b2ca upstream.

Now, extended attribute value maximum length is 64K. The memory
requested here does not need continuous physical addresses, so it is
appropriate to use kvmalloc to request memory. At the same time, it
can also cope with the situation that the extended attribute will
become longer in the future.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221208023233.1231330-3-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:03 +01:00
Jan Kara
e995ff918e ext4: avoid unaccounted block allocation when expanding inode
commit 8994d11395f8165b3deca1971946f549f0822630 upstream.

When expanding inode space in ext4_expand_extra_isize_ea() we may need
to allocate external xattr block. If quota is not initialized for the
inode, the block allocation will not be accounted into quota usage. Make
sure the quota is initialized before we try to expand inode space.

Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Link: https://lore.kernel.org/all/Y5BT+k6xWqthZc1P@xpf.sh.intel.com
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20221207115937.26601-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:03 +01:00
Jan Kara
877247222a ext4: initialize quota before expanding inode in setproject ioctl
commit 1485f726c6dec1a1f85438f2962feaa3d585526f upstream.

Make sure we initialize quotas before possibly expanding inode space
(and thus maybe needing to allocate external xattr block) in
ext4_ioctl_setproject(). This prevents not accounting the necessary
block allocation.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20221207115937.26601-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:03 +01:00
Ye Bin
322cf639b0 ext4: fix inode leak in ext4_xattr_inode_create() on an error path
commit e4db04f7d3dbbe16680e0ded27ea2a65b10f766a upstream.

There is issue as follows when do setxattr with inject fault:

[localhost]# fsck.ext4  -fn  /dev/sda
e2fsck 1.46.6-rc1 (12-Sep-2022)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached zero-length inode 15.  Clear? no

Unattached inode 15
Connect to /lost+found? no

Pass 5: Checking group summary information

/dev/sda: ********** WARNING: Filesystem still has errors **********

/dev/sda: 15/655360 files (0.0% non-contiguous), 66755/2621440 blocks

This occurs in 'ext4_xattr_inode_create()'. If 'ext4_mark_inode_dirty()'
fails, dropping i_nlink of the inode is needed. Or will lead to inode leak.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221208023233.1231330-5-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Ye Bin
6380a93b57 ext4: fix kernel BUG in 'ext4_write_inline_data_end()'
commit 5c099c4fdc438014d5893629e70a8ba934433ee8 upstream.

Syzbot report follow issue:
------------[ cut here ]------------
kernel BUG at fs/ext4/inline.c:227!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227
RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000
RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006
RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199
R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b
R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82
FS:  000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0
Call Trace:
 <TASK>
 ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768
 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313
 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063
 generic_perform_write+0x316/0x570 mm/filemap.c:3764
 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285
 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700
 call_write_iter include/linux/fs.h:2191 [inline]
 do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
 do_iter_write+0x182/0x700 fs/read_write.c:861
 vfs_iter_write+0x74/0xa0 fs/read_write.c:902
 iter_file_splice_write+0x745/0xc90 fs/splice.c:686
 do_splice_from fs/splice.c:764 [inline]
 direct_splice_actor+0x114/0x180 fs/splice.c:931
 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
 do_splice_direct+0x1ab/0x280 fs/splice.c:974
 do_sendfile+0xb19/0x1270 fs/read_write.c:1255
 __do_sys_sendfile64 fs/read_write.c:1323 [inline]
 __se_sys_sendfile64 fs/read_write.c:1309 [inline]
 __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
---[ end trace 0000000000000000 ]---

Above issue may happens as follows:
ext4_da_write_begin
  ext4_da_write_inline_data_begin
    ext4_da_convert_inline_data_to_extent
      ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
ext4_da_write_end

ext4_run_li_request
  ext4_mb_prefetch
    ext4_read_block_bitmap_nowait
      ext4_validate_block_bitmap
        ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT)
	 percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free);
	  -> sbi->s_freeclusters_counter become zero
ext4_da_write_begin
  if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true
    *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
    ext4_write_begin
ext4_da_write_end
  if (write_mode == FALL_BACK_TO_NONDELALLOC)
    ext4_write_end
      if (inline_data)
        ext4_write_inline_data_end
	  ext4_write_inline_data
	    BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);
           -> As inode is already convert to extent, so 'pos + len' > inline_size
	   -> then trigger BUG.

To solve this issue, instead of checking ext4_has_inline_data() which
is only cleared after data has been written back, check the
EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end().

Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data")
Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com
Reported-by: Jun Nie <jun.nie@linaro.org>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Jan Kara
dc3bbc9753 ext4: avoid BUG_ON when creating xattrs
commit b40ebaf63851b3a401b0dc9263843538f64f5ce6 upstream.

Commit fb0a387dcdcd ("ext4: limit block allocations for indirect-block
files to < 2^32") added code to try to allocate xattr block with 32-bit
block number for indirect block based files on the grounds that these
files cannot use larger block numbers. It also added BUG_ON when
allocated block could not fit into 32 bits. This is however bogus
reasoning because xattr block is stored in inode->i_file_acl and
inode->i_file_acl_hi and as such even indirect block based files can
happily use full 48 bits for xattr block number. The proper handling
seems to be there basically since 64-bit block number support was added.
So remove the bogus limitation and BUG_ON.

Cc: Eric Sandeen <sandeen@redhat.com>
Fixes: fb0a387dcdcd ("ext4: limit block allocations for indirect-block files to < 2^32")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221121130929.32031-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Luís Henriques
844c405552 ext4: fix error code return to user-space in ext4_get_branch()
commit 26d75a16af285a70863ba6a81f85d81e7e65da50 upstream.

If a block is out of range in ext4_get_branch(), -ENOMEM will be returned
to user-space.  Obviously, this error code isn't really useful.  This
patch fixes it by making sure the right error code (-EFSCORRUPTED) is
propagated to user-space.  EUCLEAN is more informative than ENOMEM.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20221109181445.17843-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Baokun Li
b870b28e29 ext4: fix corruption when online resizing a 1K bigalloc fs
commit 0aeaa2559d6d53358fca3e3fce73807367adca74 upstream.

When a backup superblock is updated in update_backups(), the primary
superblock's offset in the group (that is, sbi->s_sbh->b_blocknr) is used
as the backup superblock's offset in its group. However, when the block
size is 1K and bigalloc is enabled, the two offsets are not equal. This
causes the backup group descriptors to be overwritten by the superblock
in update_backups(). Moreover, if meta_bg is enabled, the file system will
be corrupted because this feature uses backup group descriptors.

To solve this issue, we use a more accurate ext4_group_first_block_no() as
the offset of the backup superblock in its group.

Fixes: d77147ff443b ("ext4: add support for online resizing with bigalloc")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20221117040341.1380702-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Eric Whitney
d440d6427a ext4: fix delayed allocation bug in ext4_clu_mapped for bigalloc + inline
commit 131294c35ed6f777bd4e79d42af13b5c41bf2775 upstream.

When converting files with inline data to extents, delayed allocations
made on a file system created with both the bigalloc and inline options
can result in invalid extent status cache content, incorrect reserved
cluster counts, kernel memory leaks, and potential kernel panics.

With bigalloc, the code that determines whether a block must be
delayed allocated searches the extent tree to see if that block maps
to a previously allocated cluster.  If not, the block is delayed
allocated, and otherwise, it isn't.  However, if the inline option is
also used, and if the file containing the block is marked as able to
store data inline, there isn't a valid extent tree associated with
the file.  The current code in ext4_clu_mapped() calls
ext4_find_extent() to search the non-existent tree for a previously
allocated cluster anyway, which typically finds nothing, as desired.
However, a side effect of the search can be to cache invalid content
from the non-existent tree (garbage) in the extent status tree,
including bogus entries in the pending reservation tree.

To fix this, avoid searching the extent tree when allocating blocks
for bigalloc + inline files that are being converted from inline to
extent mapped.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20221117152207.2424-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Ye Bin
def7a39091 ext4: init quota for 'old.inode' in 'ext4_rename'
commit fae381a3d79bb94aa2eb752170d47458d778b797 upstream.

Syzbot found the following issue:
ext4_parse_param: s_want_extra_isize=128
ext4_inode_info_init: s_want_extra_isize=32
ext4_rename: old.inode=ffff88823869a2c8 old.dir=ffff888238699828 new.inode=ffff88823869d7e8 new.dir=ffff888238699828
__ext4_mark_inode_dirty: inode=ffff888238699828 ea_isize=32 want_ea_size=128
__ext4_mark_inode_dirty: inode=ffff88823869a2c8 ea_isize=32 want_ea_size=128
ext4_xattr_block_set: inode=ffff88823869a2c8
------------[ cut here ]------------
WARNING: CPU: 13 PID: 2234 at fs/ext4/xattr.c:2070 ext4_xattr_block_set.cold+0x22/0x980
Modules linked in:
RIP: 0010:ext4_xattr_block_set.cold+0x22/0x980
RSP: 0018:ffff888227d3f3b0 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff88823007a000 RCX: 0000000000000000
RDX: 0000000000000a03 RSI: 0000000000000040 RDI: ffff888230078178
RBP: 0000000000000000 R08: 000000000000002c R09: ffffed1075c7df8e
R10: ffff8883ae3efc6b R11: ffffed1075c7df8d R12: 0000000000000000
R13: ffff88823869a2c8 R14: ffff8881012e0460 R15: dffffc0000000000
FS:  00007f350ac1f740(0000) GS:ffff8883ae200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f350a6ed6a0 CR3: 0000000237456000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? ext4_xattr_set_entry+0x3b7/0x2320
 ? ext4_xattr_block_set+0x0/0x2020
 ? ext4_xattr_set_entry+0x0/0x2320
 ? ext4_xattr_check_entries+0x77/0x310
 ? ext4_xattr_ibody_set+0x23b/0x340
 ext4_xattr_move_to_block+0x594/0x720
 ext4_expand_extra_isize_ea+0x59a/0x10f0
 __ext4_expand_extra_isize+0x278/0x3f0
 __ext4_mark_inode_dirty.cold+0x347/0x410
 ext4_rename+0xed3/0x174f
 vfs_rename+0x13a7/0x2510
 do_renameat2+0x55d/0x920
 __x64_sys_rename+0x7d/0xb0
 do_syscall_64+0x3b/0xa0
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

As 'ext4_rename' will modify 'old.inode' ctime and mark inode dirty,
which may trigger expand 'extra_isize' and allocate block. If inode
didn't init quota will lead to warning.  To solve above issue, init
'old.inode' firstly in 'ext4_rename'.

Reported-by: syzbot+98346927678ac3059c77@syzkaller.appspotmail.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221107015335.2524319-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:02 +01:00
Ye Bin
3c31d8d3ad ext4: fix uninititialized value in 'ext4_evict_inode'
commit 7ea71af94eaaaf6d9aed24bc94a05b977a741cb9 upstream.

Syzbot found the following issue:
=====================================================
BUG: KMSAN: uninit-value in ext4_evict_inode+0xdd/0x26b0 fs/ext4/inode.c:180
 ext4_evict_inode+0xdd/0x26b0 fs/ext4/inode.c:180
 evict+0x365/0x9a0 fs/inode.c:664
 iput_final fs/inode.c:1747 [inline]
 iput+0x985/0xdd0 fs/inode.c:1773
 __ext4_new_inode+0xe54/0x7ec0 fs/ext4/ialloc.c:1361
 ext4_mknod+0x376/0x840 fs/ext4/namei.c:2844
 vfs_mknod+0x79d/0x830 fs/namei.c:3914
 do_mknodat+0x47d/0xaa0
 __do_sys_mknodat fs/namei.c:3992 [inline]
 __se_sys_mknodat fs/namei.c:3989 [inline]
 __ia32_sys_mknodat+0xeb/0x150 fs/namei.c:3989
 do_syscall_32_irqs_on arch/x86/entry/common.c:112 [inline]
 __do_fast_syscall_32+0xa2/0x100 arch/x86/entry/common.c:178
 do_fast_syscall_32+0x33/0x70 arch/x86/entry/common.c:203
 do_SYSENTER_32+0x1b/0x20 arch/x86/entry/common.c:246
 entry_SYSENTER_compat_after_hwframe+0x70/0x82

Uninit was created at:
 __alloc_pages+0x9f1/0xe80 mm/page_alloc.c:5578
 alloc_pages+0xaae/0xd80 mm/mempolicy.c:2285
 alloc_slab_page mm/slub.c:1794 [inline]
 allocate_slab+0x1b5/0x1010 mm/slub.c:1939
 new_slab mm/slub.c:1992 [inline]
 ___slab_alloc+0x10c3/0x2d60 mm/slub.c:3180
 __slab_alloc mm/slub.c:3279 [inline]
 slab_alloc_node mm/slub.c:3364 [inline]
 slab_alloc mm/slub.c:3406 [inline]
 __kmem_cache_alloc_lru mm/slub.c:3413 [inline]
 kmem_cache_alloc_lru+0x6f3/0xb30 mm/slub.c:3429
 alloc_inode_sb include/linux/fs.h:3117 [inline]
 ext4_alloc_inode+0x5f/0x860 fs/ext4/super.c:1321
 alloc_inode+0x83/0x440 fs/inode.c:259
 new_inode_pseudo fs/inode.c:1018 [inline]
 new_inode+0x3b/0x430 fs/inode.c:1046
 __ext4_new_inode+0x2a7/0x7ec0 fs/ext4/ialloc.c:959
 ext4_mkdir+0x4d5/0x1560 fs/ext4/namei.c:2992
 vfs_mkdir+0x62a/0x870 fs/namei.c:4035
 do_mkdirat+0x466/0x7b0 fs/namei.c:4060
 __do_sys_mkdirat fs/namei.c:4075 [inline]
 __se_sys_mkdirat fs/namei.c:4073 [inline]
 __ia32_sys_mkdirat+0xc4/0x120 fs/namei.c:4073
 do_syscall_32_irqs_on arch/x86/entry/common.c:112 [inline]
 __do_fast_syscall_32+0xa2/0x100 arch/x86/entry/common.c:178
 do_fast_syscall_32+0x33/0x70 arch/x86/entry/common.c:203
 do_SYSENTER_32+0x1b/0x20 arch/x86/entry/common.c:246
 entry_SYSENTER_compat_after_hwframe+0x70/0x82

CPU: 1 PID: 4625 Comm: syz-executor.2 Not tainted 6.1.0-rc4-syzkaller-62821-gcb231e2f67ec #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
=====================================================

Now, 'ext4_alloc_inode()' didn't init 'ei->i_flags'. If new inode failed
before set 'ei->i_flags' in '__ext4_new_inode()', then do 'iput()'. As after
6bc0d63dad7f commit will access 'ei->i_flags' in 'ext4_evict_inode()' which
will lead to access uninit-value.
To solve above issue just init 'ei->i_flags' in 'ext4_alloc_inode()'.

Reported-by: syzbot+57b25da729eb0b88177d@syzkaller.appspotmail.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Fixes: 6bc0d63dad7f ("ext4: remove EA inode entry from mbcache on inode eviction")
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221117073603.2598882-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Eric Biggers
871800770d ext4: fix leaking uninitialized memory in fast-commit journal
commit 594bc43b410316d70bb42aeff168837888d96810 upstream.

When space at the end of fast-commit journal blocks is unused, make sure
to zero it out so that uninitialized memory is not leaked to disk.

Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-4-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Baokun Li
d480a49c15 ext4: fix bug_on in __es_tree_search caused by bad boot loader inode
commit 991ed014de0840c5dc405b679168924afb2952ac upstream.

We got a issue as fllows:
==================================================================
 kernel BUG at fs/ext4/extents_status.c:203!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 1 PID: 945 Comm: cat Not tainted 6.0.0-next-20221007-dirty #349
 RIP: 0010:ext4_es_end.isra.0+0x34/0x42
 RSP: 0018:ffffc9000143b768 EFLAGS: 00010203
 RAX: 0000000000000000 RBX: ffff8881769cd0b8 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffff8fc27cf7 RDI: 00000000ffffffff
 RBP: ffff8881769cd0bc R08: 0000000000000000 R09: ffffc9000143b5f8
 R10: 0000000000000001 R11: 0000000000000001 R12: ffff8881769cd0a0
 R13: ffff8881768e5668 R14: 00000000768e52f0 R15: 0000000000000000
 FS: 00007f359f7f05c0(0000)GS:ffff88842fd00000(0000)knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f359f5a2000 CR3: 000000017130c000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  __es_tree_search.isra.0+0x6d/0xf5
  ext4_es_cache_extent+0xfa/0x230
  ext4_cache_extents+0xd2/0x110
  ext4_find_extent+0x5d5/0x8c0
  ext4_ext_map_blocks+0x9c/0x1d30
  ext4_map_blocks+0x431/0xa50
  ext4_mpage_readpages+0x48e/0xe40
  ext4_readahead+0x47/0x50
  read_pages+0x82/0x530
  page_cache_ra_unbounded+0x199/0x2a0
  do_page_cache_ra+0x47/0x70
  page_cache_ra_order+0x242/0x400
  ondemand_readahead+0x1e8/0x4b0
  page_cache_sync_ra+0xf4/0x110
  filemap_get_pages+0x131/0xb20
  filemap_read+0xda/0x4b0
  generic_file_read_iter+0x13a/0x250
  ext4_file_read_iter+0x59/0x1d0
  vfs_read+0x28f/0x460
  ksys_read+0x73/0x160
  __x64_sys_read+0x1e/0x30
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  </TASK>
==================================================================

In the above issue, ioctl invokes the swap_inode_boot_loader function to
swap inode<5> and inode<12>. However, inode<5> contain incorrect imode and
disordered extents, and i_nlink is set to 1. The extents check for inode in
the ext4_iget function can be bypassed bacause 5 is EXT4_BOOT_LOADER_INO.
While links_count is set to 1, the extents are not initialized in
swap_inode_boot_loader. After the ioctl command is executed successfully,
the extents are swapped to inode<12>, in this case, run the `cat` command
to view inode<12>. And Bug_ON is triggered due to the incorrect extents.

When the boot loader inode is not initialized, its imode can be one of the
following:
1) the imode is a bad type, which is marked as bad_inode in ext4_iget and
   set to S_IFREG.
2) the imode is good type but not S_IFREG.
3) the imode is S_IFREG.

The BUG_ON may be triggered by bypassing the check in cases 1 and 2.
Therefore, when the boot loader inode is bad_inode or its imode is not
S_IFREG, initialize the inode to avoid triggering the BUG.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221026042310.3839669-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Zhang Yi
91009e361e ext4: check and assert if marking an no_delete evicting inode dirty
commit 318cdc822c63b6e2befcfdc2088378ae6fa18def upstream.

In ext4_evict_inode(), if we evicting an inode in the 'no_delete' path,
it cannot be raced by another mark_inode_dirty(). If it happens,
someone else may accidentally dirty it without holding inode refcount
and probably cause use-after-free issues in the writeback procedure.
It's indiscoverable and hard to debug, so add an WARN_ON_ONCE() to
check and detect this issue in advance.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220629112647.4141034-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Ye Bin
820eacbc4e ext4: fix reserved cluster accounting in __es_remove_extent()
commit 1da18e38cb97e9521e93d63034521a9649524f64 upstream.

When bigalloc is enabled, reserved cluster accounting for delayed
allocation is handled in extent_status.c.  With a corrupted file
system, it's possible for this accounting to be incorrect,
dsicovered by Syzbot:

EXT4-fs error (device loop0): ext4_validate_block_bitmap:398: comm rep:
	bg 0: block 5: invalid block bitmap
EXT4-fs (loop0): Delayed block allocation failed for inode 18 at logical
	offset 0 with max blocks 32 with error 28
EXT4-fs (loop0): This should not happen!! Data will be lost

EXT4-fs (loop0): Total free blocks count 0
EXT4-fs (loop0): Free/Dirty block details
EXT4-fs (loop0): free_blocks=0
EXT4-fs (loop0): dirty_blocks=32
EXT4-fs (loop0): Block reservation details
EXT4-fs (loop0): i_reserved_data_blocks=2
EXT4-fs (loop0): Inode 18 (00000000845cd634):
	i_reserved_data_blocks (1) not cleared!

Above issue happens as follows:
Assume:
sbi->s_cluster_ratio = 16
Step1:
Insert delay block [0, 31] -> ei->i_reserved_data_blocks=2
Step2:
ext4_writepages
  mpage_map_and_submit_extent -> return failed
  mpage_release_unused_pages -> to release [0, 30]
    ext4_es_remove_extent -> remove lblk=0 end=30
      __es_remove_extent -> len1=0 len2=31-30=1
 __es_remove_extent:
 ...
 if (len2 > 0) {
  ...
	  if (len1 > 0) {
		  ...
	  } else {
		es->es_lblk = end + 1;
		es->es_len = len2;
		...
	  }
  	if (count_reserved)
		count_rsvd(inode, lblk, ...);
	goto out; -> will return but didn't calculate 'reserved'
 ...
Step3:
ext4_destroy_inode -> trigger "i_reserved_data_blocks (1) not cleared!"

To solve above issue if 'len2>0' call 'get_rsvd()' before goto out.

Reported-by: syzbot+05a0f0ccab4a25626e38@syzkaller.appspotmail.com
Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20221208033426.1832460-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Baokun Li
0dcbf4dc3d ext4: fix bug_on in __es_tree_search caused by bad quota inode
commit d323877484765aaacbb2769b06e355c2041ed115 upstream.

We got a issue as fllows:
==================================================================
 kernel BUG at fs/ext4/extents_status.c:202!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 1 PID: 810 Comm: mount Not tainted 6.1.0-rc1-next-g9631525255e3 #352
 RIP: 0010:__es_tree_search.isra.0+0xb8/0xe0
 RSP: 0018:ffffc90001227900 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: 0000000077512a0f RCX: 0000000000000000
 RDX: 0000000000000002 RSI: 0000000000002a10 RDI: ffff8881004cd0c8
 RBP: ffff888177512ac8 R08: 47ffffffffffffff R09: 0000000000000001
 R10: 0000000000000001 R11: 00000000000679af R12: 0000000000002a10
 R13: ffff888177512d88 R14: 0000000077512a10 R15: 0000000000000000
 FS: 00007f4bd76dbc40(0000)GS:ffff88842fd00000(0000)knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00005653bf993cf8 CR3: 000000017bfdf000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ext4_es_cache_extent+0xe2/0x210
  ext4_cache_extents+0xd2/0x110
  ext4_find_extent+0x5d5/0x8c0
  ext4_ext_map_blocks+0x9c/0x1d30
  ext4_map_blocks+0x431/0xa50
  ext4_getblk+0x82/0x340
  ext4_bread+0x14/0x110
  ext4_quota_read+0xf0/0x180
  v2_read_header+0x24/0x90
  v2_check_quota_file+0x2f/0xa0
  dquot_load_quota_sb+0x26c/0x760
  dquot_load_quota_inode+0xa5/0x190
  ext4_enable_quotas+0x14c/0x300
  __ext4_fill_super+0x31cc/0x32c0
  ext4_fill_super+0x115/0x2d0
  get_tree_bdev+0x1d2/0x360
  ext4_get_tree+0x19/0x30
  vfs_get_tree+0x26/0xe0
  path_mount+0x81d/0xfc0
  do_mount+0x8d/0xc0
  __x64_sys_mount+0xc0/0x160
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  </TASK>
==================================================================

Above issue may happen as follows:
-------------------------------------
ext4_fill_super
 ext4_orphan_cleanup
  ext4_enable_quotas
   ext4_quota_enable
    ext4_iget --> get error inode <5>
     ext4_ext_check_inode --> Wrong imode makes it escape inspection
     make_bad_inode(inode) --> EXT4_BOOT_LOADER_INO set imode
    dquot_load_quota_inode
     vfs_setup_quota_inode --> check pass
     dquot_load_quota_sb
      v2_check_quota_file
       v2_read_header
        ext4_quota_read
         ext4_bread
          ext4_getblk
           ext4_map_blocks
            ext4_ext_map_blocks
             ext4_find_extent
              ext4_cache_extents
               ext4_es_cache_extent
                __es_tree_search.isra.0
                 ext4_es_end --> Wrong extents trigger BUG_ON

In the above issue, s_usr_quota_inum is set to 5, but inode<5> contains
incorrect imode and disordered extents. Because 5 is EXT4_BOOT_LOADER_INO,
the ext4_ext_check_inode check in the ext4_iget function can be bypassed,
finally, the extents that are not checked trigger the BUG_ON in the
__es_tree_search function. To solve this issue, check whether the inode is
bad_inode in vfs_setup_quota_inode().

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221026042310.3839669-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Baokun Li
06a20a68bb ext4: add helper to check quota inums
commit 07342ec259df2a35d6a34aebce010567a80a0e15 upstream.

Before quota is enabled, a check on the preset quota inums in
ext4_super_block is added to prevent wrong quota inodes from being loaded.
In addition, when the quota fails to be enabled, the quota type and quota
inum are printed to facilitate fault locating.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221026042310.3839669-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Baokun Li
f7e6b5548f ext4: add EXT4_IGET_BAD flag to prevent unexpected bad inode
commit 63b1e9bccb71fe7d7e3ddc9877dbdc85e5d2d023 upstream.

There are many places that will get unhappy (and crash) when ext4_iget()
returns a bad inode. However, if iget the boot loader inode, allows a bad
inode to be returned, because the inode may not be initialized. This
mechanism can be used to bypass some checks and cause panic. To solve this
problem, we add a special iget flag EXT4_IGET_BAD. Only with this flag
we'd be returning bad inode from ext4_iget(), otherwise we always return
the error code if the inode is bad inode.(suggested by Jan Kara)

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221026042310.3839669-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:01 +01:00
Gaosheng Cui
205ac16628 ext4: fix undefined behavior in bit shift for ext4_check_flag_values
commit 3bf678a0f9c017c9ba7c581541dbc8453452a7ae upstream.

Shifting signed 32-bit value by 31 bits is undefined, so changing
significant bit to unsigned. The UBSAN warning calltrace like below:

UBSAN: shift-out-of-bounds in fs/ext4/ext4.h:591:2
left shift of 1 by 31 places cannot be represented in type 'int'
Call Trace:
 <TASK>
 dump_stack_lvl+0x7d/0xa5
 dump_stack+0x15/0x1b
 ubsan_epilogue+0xe/0x4e
 __ubsan_handle_shift_out_of_bounds+0x1e7/0x20c
 ext4_init_fs+0x5a/0x277
 do_one_initcall+0x76/0x430
 kernel_init_freeable+0x3b3/0x422
 kernel_init+0x24/0x1e0
 ret_from_fork+0x1f/0x30
 </TASK>

Fixes: 9a4c80194713 ("ext4: ensure Inode flags consistency are checked at build time")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Link: https://lore.kernel.org/r/20221031055833.3966222-1-cuigaosheng1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:00 +01:00
Baokun Li
cf0e0817b0 ext4: fix use-after-free in ext4_orphan_cleanup
commit a71248b1accb2b42e4980afef4fa4a27fa0e36f5 upstream.

I caught a issue as follows:
==================================================================
 BUG: KASAN: use-after-free in __list_add_valid+0x28/0x1a0
 Read of size 8 at addr ffff88814b13f378 by task mount/710

 CPU: 1 PID: 710 Comm: mount Not tainted 6.1.0-rc3-next #370
 Call Trace:
  <TASK>
  dump_stack_lvl+0x73/0x9f
  print_report+0x25d/0x759
  kasan_report+0xc0/0x120
  __asan_load8+0x99/0x140
  __list_add_valid+0x28/0x1a0
  ext4_orphan_cleanup+0x564/0x9d0 [ext4]
  __ext4_fill_super+0x48e2/0x5300 [ext4]
  ext4_fill_super+0x19f/0x3a0 [ext4]
  get_tree_bdev+0x27b/0x450
  ext4_get_tree+0x19/0x30 [ext4]
  vfs_get_tree+0x49/0x150
  path_mount+0xaae/0x1350
  do_mount+0xe2/0x110
  __x64_sys_mount+0xf0/0x190
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  </TASK>
 [...]
==================================================================

Above issue may happen as follows:
-------------------------------------
ext4_fill_super
  ext4_orphan_cleanup
   --- loop1: assume last_orphan is 12 ---
    list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan)
    ext4_truncate --> return 0
      ext4_inode_attach_jinode --> return -ENOMEM
    iput(inode) --> free inode<12>
   --- loop2: last_orphan is still 12 ---
    list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
    // use inode<12> and trigger UAF

To solve this issue, we need to propagate the return value of
ext4_inode_attach_jinode() appropriately.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221102080633.1630225-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:00 +01:00
Alexander Potapenko
970bfd7a41 fs: ext4: initialize fsdata in pagecache_write()
commit 956510c0c7439e90b8103aaeaf4da92878c622f0 upstream.

When aops->write_begin() does not initialize fsdata, KMSAN reports
an error passing the latter to aops->write_end().

Fix this by unconditionally initializing fsdata.

Cc: Eric Biggers <ebiggers@kernel.org>
Fixes: c93d8f885809 ("ext4: add basic fs-verity support")
Reported-by: syzbot+9767be679ef5016b6082@syzkaller.appspotmail.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221121112134.407362-1-glider@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:00 +01:00
Luís Henriques
744bbde378 ext4: remove trailing newline from ext4_msg() message
commit 78742d4d056df7d2fad241c90185d281bf924844 upstream.

The ext4_msg() function adds a new line to the message.  Remove extra '\n'
from call to ext4_msg() in ext4_orphan_cleanup().

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20221011155758.15287-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:00 +01:00
Baokun Li
7192afa5e4 ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
commit eee22187b53611e173161e38f61de1c7ecbeb876 upstream.

In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.

In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
the function returns -ENOMEM.

In __getblk_slow, if the return value of grow_buffers is less than 0,
the function returns NULL.

When the three processes are connected in series like the following stack,
an infinite loop may occur:

do_writepages					<--- keep retrying
 ext4_writepages
  mpage_map_and_submit_extent
   mpage_map_one_extent
    ext4_map_blocks
     ext4_ext_map_blocks
      ext4_ext_handle_unwritten_extents
       ext4_ext_convert_to_initialized
        ext4_split_extent
         ext4_split_extent_at
          __ext4_ext_dirty
           __ext4_mark_inode_dirty
            ext4_reserve_inode_write
             ext4_get_inode_loc
              __ext4_get_inode_loc		<--- return -ENOMEM
               sb_getblk
                __getblk_gfp
                 __getblk_slow			<--- return NULL
                  grow_buffers
                   grow_dev_page		<--- return -ENXIO
                    ret = (block < end_block) ? 1 : -ENXIO;

In this issue, bg_inode_table_hi is overwritten as an incorrect value.
As a result, `block < end_block` cannot be met in grow_dev_page.
Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
keeps retrying. As a result, the writeback process is in the D state due
to an infinite loop.

Add a check on inode table block in the __ext4_get_inode_loc function by
referring to ext4_read_inode_bitmap to avoid this infinite loop.

Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220817132701.3015912-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:00 +01:00
Zhang Yi
0d041b7251 ext4: silence the warning when evicting inode with dioread_nolock
commit bc12ac98ea2e1b70adc6478c8b473a0003b659d3 upstream.

When evicting an inode with default dioread_nolock, it could be raced by
the unwritten extents converting kworker after writeback some new
allocated dirty blocks. It convert unwritten extents to written, the
extents could be merged to upper level and free extent blocks, so it
could mark the inode dirty again even this inode has been marked
I_FREEING. But the inode->i_io_list check and warning in
ext4_evict_inode() missing this corner case. Fortunately,
ext4_evict_inode() will wait all extents converting finished before this
check, so it will not lead to inode use-after-free problem, every thing
is OK besides this warning. The WARN_ON_ONCE was originally designed
for finding inode use-after-free issues in advance, but if we add
current dioread_nolock case in, it will become not quite useful, so fix
this warning by just remove this check.

 ======
 WARNING: CPU: 7 PID: 1092 at fs/ext4/inode.c:227
 ext4_evict_inode+0x875/0xc60
 ...
 RIP: 0010:ext4_evict_inode+0x875/0xc60
 ...
 Call Trace:
  <TASK>
  evict+0x11c/0x2b0
  iput+0x236/0x3a0
  do_unlinkat+0x1b4/0x490
  __x64_sys_unlinkat+0x4c/0xb0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7fa933c1115b
 ======

rm                          kworker
                            ext4_end_io_end()
vfs_unlink()
 ext4_unlink()
                             ext4_convert_unwritten_io_end_vec()
                              ext4_convert_unwritten_extents()
                               ext4_map_blocks()
                                ext4_ext_map_blocks()
                                 ext4_ext_try_to_merge_up()
                                  __mark_inode_dirty()
                                   check !I_FREEING
                                   locked_inode_to_wb_and_lock_list()
 iput()
  iput_final()
   evict()
    ext4_evict_inode()
     truncate_inode_pages_final() //wait release io_end
                                    inode_io_list_move_locked()
                             ext4_release_io_end()
     trigger WARN_ON_ONCE()

Cc: stable@kernel.org
Fixes: ceff86fddae8 ("ext4: Avoid freeing inodes on dirty list")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220629112647.4141034-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:59:00 +01:00
Steve French
5eb8296d73 cifs: fix missing display of three mount options
commit 2bfd81043e944af0e52835ef6d9b41795af22341 upstream.

Three mount options: "tcpnodelay" and "noautotune" and "noblocksend"
were not displayed when passed in on cifs/smb3 mounts (e.g. displayed
in /proc/mounts e.g.).  No change to defaults so these are not
displayed if not specified on mount.

Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:56 +01:00
Paulo Alcantara
cfa9f66f91 cifs: fix confusing debug message
commit a85ceafd41927e41a4103d228a993df7edd8823b upstream.

Since rc was initialised to -ENOMEM in cifs_get_smb_ses(), when an
existing smb session was found, free_xid() would be called and then
print

  CIFS: fs/cifs/connect.c: Existing tcp session with server found
  CIFS: fs/cifs/connect.c: VFS: in cifs_get_smb_ses as Xid: 44 with uid: 0
  CIFS: fs/cifs/connect.c: Existing smb sess found (status=1)
  CIFS: fs/cifs/connect.c: VFS: leaving cifs_get_smb_ses (xid = 44) rc = -12

Fix this by initialising rc to 0 and then let free_xid() print this
instead

  CIFS: fs/cifs/connect.c: Existing tcp session with server found
  CIFS: fs/cifs/connect.c: VFS: in cifs_get_smb_ses as Xid: 14 with uid: 0
  CIFS: fs/cifs/connect.c: Existing smb sess found (status=1)
  CIFS: fs/cifs/connect.c: VFS: leaving cifs_get_smb_ses (xid = 14) rc = 0

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:56 +01:00
Alexander Aring
b2120ed7fd fs: dlm: retry accept() until -EAGAIN or error returns
commit f0f4bb431bd543ed7bebbaea3ce326cfcd5388bc upstream.

This patch fixes a race if we get two times an socket data ready event
while the listen connection worker is queued. Currently it will be
served only once but we need to do it (in this case twice) until we hit
-EAGAIN which tells us there is no pending accept going on.

This patch wraps an do while loop until we receive a return value which
is different than 0 as it was done before commit d11ccd451b65 ("fs: dlm:
listen socket out of connection hash").

Cc: stable@vger.kernel.org
Fixes: d11ccd451b65 ("fs: dlm: listen socket out of connection hash")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:51 +01:00
Alexander Aring
5b4478615f fs: dlm: fix sock release if listen fails
commit 08ae0547e75ec3d062b6b6b9cf4830c730df68df upstream.

This patch fixes a double sock_release() call when the listen() is
called for the dlm lowcomms listen socket. The caller of
dlm_listen_for_all should never care about releasing the socket if
dlm_listen_for_all() fails, it's done now only once if listen() fails.

Cc: stable@vger.kernel.org
Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:51 +01:00
Boris Burkov
99590f29b2 btrfs: fix resolving backrefs for inline extent followed by prealloc
commit 560840afc3e63bbe5d9c5ef6b2ecf8f3589adff6 upstream.

If a file consists of an inline extent followed by a regular or prealloc
extent, then a legitimate attempt to resolve a logical address in the
non-inline region will result in add_all_parents reading the invalid
offset field of the inline extent. If the inline extent item is placed
in the leaf eb s.t. it is the first item, attempting to access the
offset field will not only be meaningless, it will go past the end of
the eb and cause this panic:

  [17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8
  [17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI
  [17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199
  [17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110
  [17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202
  [17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000
  [17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001
  [17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff
  [17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918
  [17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd
  [17.663617] FS:  00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000
  [17.666525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0
  [17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [17.676034] PKRU: 55555554
  [17.677004] Call Trace:
  [17.677877]  add_all_parents+0x276/0x480
  [17.679325]  find_parent_nodes+0xfae/0x1590
  [17.680771]  btrfs_find_all_leafs+0x5e/0xa0
  [17.682217]  iterate_extent_inodes+0xce/0x260
  [17.683809]  ? btrfs_inode_flags_to_xflags+0x50/0x50
  [17.685597]  ? iterate_inodes_from_logical+0xa1/0xd0
  [17.687404]  iterate_inodes_from_logical+0xa1/0xd0
  [17.689121]  ? btrfs_inode_flags_to_xflags+0x50/0x50
  [17.691010]  btrfs_ioctl_logical_to_ino+0x131/0x190
  [17.692946]  btrfs_ioctl+0x104a/0x2f60
  [17.694384]  ? selinux_file_ioctl+0x182/0x220
  [17.695995]  ? __x64_sys_ioctl+0x84/0xc0
  [17.697394]  __x64_sys_ioctl+0x84/0xc0
  [17.698697]  do_syscall_64+0x33/0x40
  [17.700017]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [17.701753] RIP: 0033:0x7f64e72761b7
  [17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7
  [17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003
  [17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60
  [17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001
  [17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0
  [17.724839] Modules linked in:

Fix the bug by detecting the inline extent item in add_all_parents and
skipping to the next extent item.

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:50 +01:00
Jaegeuk Kim
7eb57bc92f f2fs: allow to read node block after shutdown
commit e6ecb142429183cef4835f31d4134050ae660032 upstream.

If block address is still alive, we should give a valid node block even after
shutdown. Otherwise, we can see zero data when reading out a file.

Cc: stable@vger.kernel.org
Fixes: 83a3bfdb5a8a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:47 +01:00
Pavel Machek
acc13987fd f2fs: should put a page when checking the summary info
commit c3db3c2fd9992c08f49aa93752d3c103c3a4f6aa upstream.

The commit introduces another bug.

Cc: stable@vger.kernel.org
Fixes: c6ad7fd16657e ("f2fs: fix to do sanity check on summary info")
Signed-off-by: Pavel Machek <pavel@denx.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:47 +01:00
Christian Brauner
c24cc476ac pnode: terminate at peers of source
commit 11933cf1d91d57da9e5c53822a540bbdc2656c16 upstream.

The propagate_mnt() function handles mount propagation when creating
mounts and propagates the source mount tree @source_mnt to all
applicable nodes of the destination propagation mount tree headed by
@dest_mnt.

Unfortunately it contains a bug where it fails to terminate at peers of
@source_mnt when looking up copies of the source mount that become
masters for copies of the source mount tree mounted on top of slaves in
the destination propagation tree causing a NULL dereference.

Once the mechanics of the bug are understood it's easy to trigger.
Because of unprivileged user namespaces it is available to unprivileged
users.

While fixing this bug we've gotten confused multiple times due to
unclear terminology or missing concepts. So let's start this with some
clarifications:

* The terms "master" or "peer" denote a shared mount. A shared mount
  belongs to a peer group.

* A peer group is a set of shared mounts that propagate to each other.
  They are identified by a peer group id. The peer group id is available
  in @shared_mnt->mnt_group_id.
  Shared mounts within the same peer group have the same peer group id.
  The peers in a peer group can be reached via @shared_mnt->mnt_share.

* The terms "slave mount" or "dependent mount" denote a mount that
  receives propagation from a peer in a peer group. IOW, shared mounts
  may have slave mounts and slave mounts have shared mounts as their
  master. Slave mounts of a given peer in a peer group are listed on
  that peers slave list available at @shared_mnt->mnt_slave_list.

* The term "master mount" denotes a mount in a peer group. IOW, it
  denotes a shared mount or a peer mount in a peer group. The term
  "master mount" - or "master" for short - is mostly used when talking
  in the context of slave mounts that receive propagation from a master
  mount. A master mount of a slave identifies the closest peer group a
  slave mount receives propagation from. The master mount of a slave can
  be identified via @slave_mount->mnt_master. Different slaves may point
  to different masters in the same peer group.

* Multiple peers in a peer group can have non-empty ->mnt_slave_lists.
  Non-empty ->mnt_slave_lists of peers don't intersect. Consequently, to
  ensure all slave mounts of a peer group are visited the
  ->mnt_slave_lists of all peers in a peer group have to be walked.

* Slave mounts point to a peer in the closest peer group they receive
  propagation from via @slave_mnt->mnt_master (see above). Together with
  these peers they form a propagation group (see below). The closest
  peer group can thus be identified through the peer group id
  @slave_mnt->mnt_master->mnt_group_id of the peer/master that a slave
  mount receives propagation from.

* A shared-slave mount is a slave mount to a peer group pg1 while also
  a peer in another peer group pg2. IOW, a peer group may receive
  propagation from another peer group.

  If a peer group pg1 is a slave to another peer group pg2 then all
  peers in peer group pg1 point to the same peer in peer group pg2 via
  ->mnt_master. IOW, all peers in peer group pg1 appear on the same
  ->mnt_slave_list. IOW, they cannot be slaves to different peer groups.

* A pure slave mount is a slave mount that is a slave to a peer group
  but is not a peer in another peer group.

* A propagation group denotes the set of mounts consisting of a single
  peer group pg1 and all slave mounts and shared-slave mounts that point
  to a peer in that peer group via ->mnt_master. IOW, all slave mounts
  such that @slave_mnt->mnt_master->mnt_group_id is equal to
  @shared_mnt->mnt_group_id.

  The concept of a propagation group makes it easier to talk about a
  single propagation level in a propagation tree.

  For example, in propagate_mnt() the immediate peers of @dest_mnt and
  all slaves of @dest_mnt's peer group form a propagation group propg1.
  So a shared-slave mount that is a slave in propg1 and that is a peer
  in another peer group pg2 forms another propagation group propg2
  together with all slaves that point to that shared-slave mount in
  their ->mnt_master.

* A propagation tree refers to all mounts that receive propagation
  starting from a specific shared mount.

  For example, for propagate_mnt() @dest_mnt is the start of a
  propagation tree. The propagation tree ecompasses all mounts that
  receive propagation from @dest_mnt's peer group down to the leafs.

With that out of the way let's get to the actual algorithm.

We know that @dest_mnt is guaranteed to be a pure shared mount or a
shared-slave mount. This is guaranteed by a check in
attach_recursive_mnt(). So propagate_mnt() will first propagate the
source mount tree to all peers in @dest_mnt's peer group:

for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
        ret = propagate_one(n);
        if (ret)
               goto out;
}

Notice, that the peer propagation loop of propagate_mnt() doesn't
propagate @dest_mnt itself. @dest_mnt is mounted directly in
attach_recursive_mnt() after we propagated to the destination
propagation tree.

The mount that will be mounted on top of @dest_mnt is @source_mnt. This
copy was created earlier even before we entered attach_recursive_mnt()
and doesn't concern us a lot here.

It's just important to notice that when propagate_mnt() is called
@source_mnt will not yet have been mounted on top of @dest_mnt. Thus,
@source_mnt->mnt_parent will either still point to @source_mnt or - in
the case @source_mnt is moved and thus already attached - still to its
former parent.

For each peer @m in @dest_mnt's peer group propagate_one() will create a
new copy of the source mount tree and mount that copy @child on @m such
that @child->mnt_parent points to @m after propagate_one() returns.

propagate_one() will stash the last destination propagation node @m in
@last_dest and the last copy it created for the source mount tree in
@last_source.

Hence, if we call into propagate_one() again for the next destination
propagation node @m, @last_dest will point to the previous destination
propagation node and @last_source will point to the previous copy of the
source mount tree and mounted on @last_dest.

Each new copy of the source mount tree is created from the previous copy
of the source mount tree. This will become important later.

The peer loop in propagate_mnt() is straightforward. We iterate through
the peers copying and updating @last_source and @last_dest as we go
through them and mount each copy of the source mount tree @child on a
peer @m in @dest_mnt's peer group.

After propagate_mnt() handled the peers in @dest_mnt's peer group
propagate_mnt() will propagate the source mount tree down the
propagation tree that @dest_mnt's peer group propagates to:

for (m = next_group(dest_mnt, dest_mnt); m;
                m = next_group(m, dest_mnt)) {
        /* everything in that slave group */
        n = m;
        do {
                ret = propagate_one(n);
                if (ret)
                        goto out;
                n = next_peer(n);
        } while (n != m);
}

The next_group() helper will recursively walk the destination
propagation tree, descending into each propagation group of the
propagation tree.

The important part is that it takes care to propagate the source mount
tree to all peers in the peer group of a propagation group before it
propagates to the slaves to those peers in the propagation group. IOW,
it creates and mounts copies of the source mount tree that become
masters before it creates and mounts copies of the source mount tree
that become slaves to these masters.

It is important to remember that propagating the source mount tree to
each mount @m in the destination propagation tree simply means that we
create and mount new copies @child of the source mount tree on @m such
that @child->mnt_parent points to @m.

Since we know that each node @m in the destination propagation tree
headed by @dest_mnt's peer group will be overmounted with a copy of the
source mount tree and since we know that the propagation properties of
each copy of the source mount tree we create and mount at @m will mostly
mirror the propagation properties of @m. We can use that information to
create and mount the copies of the source mount tree that become masters
before their slaves.

The easy case is always when @m and @last_dest are peers in a peer group
of a given propagation group. In that case we know that we can simply
copy @last_source without having to figure out what the master for the
new copy @child of the source mount tree needs to be as we've done that
in a previous call to propagate_one().

The hard case is when we're dealing with a slave mount or a shared-slave
mount @m in a destination propagation group that we need to create and
mount a copy of the source mount tree on.

For each propagation group in the destination propagation tree we
propagate the source mount tree to we want to make sure that the copies
@child of the source mount tree we create and mount on slaves @m pick an
ealier copy of the source mount tree that we mounted on a master @m of
the destination propagation group as their master. This is a mouthful
but as far as we can tell that's the core of it all.

But, if we keep track of the masters in the destination propagation tree
@m we can use the information to find the correct master for each copy
of the source mount tree we create and mount at the slaves in the
destination propagation tree @m.

Let's walk through the base case as that's still fairly easy to grasp.

If we're dealing with the first slave in the propagation group that
@dest_mnt is in then we don't yet have marked any masters in the
destination propagation tree.

We know the master for the first slave to @dest_mnt's peer group is
simple @dest_mnt. So we expect this algorithm to yield a copy of the
source mount tree that was mounted on a peer in @dest_mnt's peer group
as the master for the copy of the source mount tree we want to mount at
the first slave @m:

for (n = m; ; n = p) {
        p = n->mnt_master;
        if (p == dest_master || IS_MNT_MARKED(p))
                break;
}

For the first slave we walk the destination propagation tree all the way
up to a peer in @dest_mnt's peer group. IOW, the propagation hierarchy
can be walked by walking up the @mnt->mnt_master hierarchy of the
destination propagation tree @m. We will ultimately find a peer in
@dest_mnt's peer group and thus ultimately @dest_mnt->mnt_master.

Btw, here the assumption we listed at the beginning becomes important.
Namely, that peers in a peer group pg1 that are slaves in another peer
group pg2 appear on the same ->mnt_slave_list. IOW, all slaves who are
peers in peer group pg1 point to the same peer in peer group pg2 via
their ->mnt_master. Otherwise the termination condition in the code
above would be wrong and next_group() would be broken too.

So the first iteration sets:

n = m;
p = n->mnt_master;

such that @p now points to a peer or @dest_mnt itself. We walk up one
more level since we don't have any marked mounts. So we end up with:

n = dest_mnt;
p = dest_mnt->mnt_master;

If @dest_mnt's peer group is not slave to another peer group then @p is
now NULL. If @dest_mnt's peer group is a slave to another peer group
then @p now points to @dest_mnt->mnt_master points which is a master
outside the propagation tree we're dealing with.

Now we need to figure out the master for the copy of the source mount
tree we're about to create and mount on the first slave of @dest_mnt's
peer group:

do {
        struct mount *parent = last_source->mnt_parent;
        if (last_source == first_source)
                break;
        done = parent->mnt_master == p;
        if (done && peers(n, parent))
                break;
        last_source = last_source->mnt_master;
} while (!done);

We know that @last_source->mnt_parent points to @last_dest and
@last_dest is the last peer in @dest_mnt's peer group we propagated to
in the peer loop in propagate_mnt().

Consequently, @last_source is the last copy we created and mount on that
last peer in @dest_mnt's peer group. So @last_source is the master we
want to pick.

We know that @last_source->mnt_parent->mnt_master points to
@last_dest->mnt_master. We also know that @last_dest->mnt_master is
either NULL or points to a master outside of the destination propagation
tree and so does @p. Hence:

done = parent->mnt_master == p;

is trivially true in the base condition.

We also know that for the first slave mount of @dest_mnt's peer group
that @last_dest either points @dest_mnt itself because it was
initialized to:

last_dest = dest_mnt;

at the beginning of propagate_mnt() or it will point to a peer of
@dest_mnt in its peer group. In both cases it is guaranteed that on the
first iteration @n and @parent are peers (Please note the check for
peers here as that's important.):

if (done && peers(n, parent))
        break;

So, as we expected, we select @last_source, which referes to the last
copy of the source mount tree we mounted on the last peer in @dest_mnt's
peer group, as the master of the first slave in @dest_mnt's peer group.
The rest is taken care of by clone_mnt(last_source, ...). We'll skip
over that part otherwise this becomes a blogpost.

At the end of propagate_mnt() we now mark @m->mnt_master as the first
master in the destination propagation tree that is distinct from
@dest_mnt->mnt_master. IOW, we mark @dest_mnt itself as a master.

By marking @dest_mnt or one of it's peers we are able to easily find it
again when we later lookup masters for other copies of the source mount
tree we mount copies of the source mount tree on slaves @m to
@dest_mnt's peer group. This, in turn allows us to find the master we
selected for the copies of the source mount tree we mounted on master in
the destination propagation tree again.

The important part is to realize that the code makes use of the fact
that the last copy of the source mount tree stashed in @last_source was
mounted on top of the previous destination propagation node @last_dest.
What this means is that @last_source allows us to walk the destination
propagation hierarchy the same way each destination propagation node @m
does.

If we take @last_source, which is the copy of @source_mnt we have
mounted on @last_dest in the previous iteration of propagate_one(), then
we know @last_source->mnt_parent points to @last_dest but we also know
that as we walk through the destination propagation tree that
@last_source->mnt_master will point to an earlier copy of the source
mount tree we mounted one an earlier destination propagation node @m.

IOW, @last_source->mnt_parent will be our hook into the destination
propagation tree and each consecutive @last_source->mnt_master will lead
us to an earlier propagation node @m via
@last_source->mnt_master->mnt_parent.

Hence, by walking up @last_source->mnt_master, each of which is mounted
on a node that is a master @m in the destination propagation tree we can
also walk up the destination propagation hierarchy.

So, for each new destination propagation node @m we use the previous
copy of @last_source and the fact it's mounted on the previous
propagation node @last_dest via @last_source->mnt_master->mnt_parent to
determine what the master of the new copy of @last_source needs to be.

The goal is to find the _closest_ master that the new copy of the source
mount tree we are about to create and mount on a slave @m in the
destination propagation tree needs to pick. IOW, we want to find a
suitable master in the propagation group.

As the propagation structure of the source mount propagation tree we
create mirrors the propagation structure of the destination propagation
tree we can find @m's closest master - i.e., a marked master - which is
a peer in the closest peer group that @m receives propagation from. We
store that closest master of @m in @p as before and record the slave to
that master in @n

We then search for this master @p via @last_source by walking up the
master hierarchy starting from the last copy of the source mount tree
stored in @last_source that we created and mounted on the previous
destination propagation node @m.

We will try to find the master by walking @last_source->mnt_master and
by comparing @last_source->mnt_master->mnt_parent->mnt_master to @p. If
we find @p then we can figure out what earlier copy of the source mount
tree needs to be the master for the new copy of the source mount tree
we're about to create and mount at the current destination propagation
node @m.

If @last_source->mnt_master->mnt_parent and @n are peers then we know
that the closest master they receive propagation from is
@last_source->mnt_master->mnt_parent->mnt_master. If not then the
closest immediate peer group that they receive propagation from must be
one level higher up.

This builds on the earlier clarification at the beginning that all peers
in a peer group which are slaves of other peer groups all point to the
same ->mnt_master, i.e., appear on the same ->mnt_slave_list, of the
closest peer group that they receive propagation from.

However, terminating the walk has corner cases.

If the closest marked master for a given destination node @m cannot be
found by walking up the master hierarchy via @last_source->mnt_master
then we need to terminate the walk when we encounter @source_mnt again.

This isn't an arbitrary termination. It simply means that the new copy
of the source mount tree we're about to create has a copy of the source
mount tree we created and mounted on a peer in @dest_mnt's peer group as
its master. IOW, @source_mnt is the peer in the closest peer group that
the new copy of the source mount tree receives propagation from.

We absolutely have to stop @source_mnt because @last_source->mnt_master
either points outside the propagation hierarchy we're dealing with or it
is NULL because @source_mnt isn't a shared-slave.

So continuing the walk past @source_mnt would cause a NULL dereference
via @last_source->mnt_master->mnt_parent. And so we have to stop the
walk when we encounter @source_mnt again.

One scenario where this can happen is when we first handled a series of
slaves of @dest_mnt's peer group and then encounter peers in a new peer
group that is a slave to @dest_mnt's peer group. We handle them and then
we encounter another slave mount to @dest_mnt that is a pure slave to
@dest_mnt's peer group. That pure slave will have a peer in @dest_mnt's
peer group as its master. Consequently, the new copy of the source mount
tree will need to have @source_mnt as it's master. So we walk the
propagation hierarchy all the way up to @source_mnt based on
@last_source->mnt_master.

So terminate on @source_mnt, easy peasy. Except, that the check misses
something that the rest of the algorithm already handles.

If @dest_mnt has peers in it's peer group the peer loop in
propagate_mnt():

for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
        ret = propagate_one(n);
        if (ret)
                goto out;
}

will consecutively update @last_source with each previous copy of the
source mount tree we created and mounted at the previous peer in
@dest_mnt's peer group. So after that loop terminates @last_source will
point to whatever copy of the source mount tree was created and mounted
on the last peer in @dest_mnt's peer group.

Furthermore, if there is even a single additional peer in @dest_mnt's
peer group then @last_source will __not__ point to @source_mnt anymore.
Because, as we mentioned above, @dest_mnt isn't even handled in this
loop but directly in attach_recursive_mnt(). So it can't even accidently
come last in that peer loop.

So the first time we handle a slave mount @m of @dest_mnt's peer group
the copy of the source mount tree we create will make the __last copy of
the source mount tree we created and mounted on the last peer in
@dest_mnt's peer group the master of the new copy of the source mount
tree we create and mount on the first slave of @dest_mnt's peer group__.

But this means that the termination condition that checks for
@source_mnt is wrong. The @source_mnt cannot be found anymore by
propagate_one(). Instead it will find the last copy of the source mount
tree we created and mounted for the last peer of @dest_mnt's peer group
again. And that is a peer of @source_mnt not @source_mnt itself.

IOW, we fail to terminate the loop correctly and ultimately dereference
@last_source->mnt_master->mnt_parent. When @source_mnt's peer group
isn't slave to another peer group then @last_source->mnt_master is NULL
causing the splat below.

For example, assume @dest_mnt is a pure shared mount and has three peers
in its peer group:

===================================================================================
                                         mount-id   mount-parent-id   peer-group-id
===================================================================================
(@dest_mnt) mnt_master[216]              309        297               shared:216
    \
     (@source_mnt) mnt_master[218]:      609        609               shared:218

(1) mnt_master[216]:                     607        605               shared:216
    \
     (P1) mnt_master[218]:               624        607               shared:218

(2) mnt_master[216]:                     576        574               shared:216
    \
     (P2) mnt_master[218]:               625        576               shared:218

(3) mnt_master[216]:                     545        543               shared:216
    \
     (P3) mnt_master[218]:               626        545               shared:218

After this sequence has been processed @last_source will point to (P3),
the copy generated for the third peer in @dest_mnt's peer group we
handled. So the copy of the source mount tree (P4) we create and mount
on the first slave of @dest_mnt's peer group:

===================================================================================
                                         mount-id   mount-parent-id   peer-group-id
===================================================================================
    mnt_master[216]                      309        297               shared:216
   /
  /
(S0) mnt_slave                           483        481               master:216
  \
   \    (P3) mnt_master[218]             626        545               shared:218
    \  /
     \/
    (P4) mnt_slave                       627        483               master:218

will pick the last copy of the source mount tree (P3) as master, not (S0).

When walking the propagation hierarchy via @last_source's master
hierarchy we encounter (P3) but not (S0), i.e., @source_mnt.

We can fix this in multiple ways:

(1) By setting @last_source to @source_mnt after we processed the peers
    in @dest_mnt's peer group right after the peer loop in
    propagate_mnt().

(2) By changing the termination condition that relies on finding exactly
    @source_mnt to finding a peer of @source_mnt.

(3) By only moving @last_source when we actually venture into a new peer
    group or some clever variant thereof.

The first two options are minimally invasive and what we want as a fix.
The third option is more intrusive but something we'd like to explore in
the near future.

This passes all LTP tests and specifically the mount propagation
testsuite part of it. It also holds up against all known reproducers of
this issues.

Final words.
First, this is a clever but __worringly__ underdocumented algorithm.
There isn't a single detailed comment to be found in next_group(),
propagate_one() or anywhere else in that file for that matter. This has
been a giant pain to understand and work through and a bug like this is
insanely difficult to fix without a detailed understanding of what's
happening. Let's not talk about the amount of time that was sunk into
fixing this.

Second, all the cool kids with access to
unshare --mount --user --map-root --propagation=unchanged
are going to have a lot of fun. IOW, triggerable by unprivileged users
while namespace_lock() lock is held.

[  115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  115.848967] #PF: supervisor read access in kernel mode
[  115.849386] #PF: error_code(0x0000) - not-present page
[  115.849803] PGD 0 P4D 0
[  115.850012] Oops: 0000 [#1] PREEMPT SMP PTI
[  115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3
[  115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[  115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
[  115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
02 4d
[  115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
[  115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
[  115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
[  115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
[  115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
[  115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
[  115.856859] FS:  00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
knlGS:0000000000000000
[  115.857531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
[  115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  115.860099] Call Trace:
[  115.860358]  <TASK>
[  115.860535]  propagate_mnt+0x14d/0x190
[  115.860848]  attach_recursive_mnt+0x274/0x3e0
[  115.861212]  path_mount+0x8c8/0xa60
[  115.861503]  __x64_sys_mount+0xf6/0x140
[  115.861819]  do_syscall_64+0x5b/0x80
[  115.862117]  ? do_faccessat+0x123/0x250
[  115.862435]  ? syscall_exit_to_user_mode+0x17/0x40
[  115.862826]  ? do_syscall_64+0x67/0x80
[  115.863133]  ? syscall_exit_to_user_mode+0x17/0x40
[  115.863527]  ? do_syscall_64+0x67/0x80
[  115.863835]  ? do_syscall_64+0x67/0x80
[  115.864144]  ? do_syscall_64+0x67/0x80
[  115.864452]  ? exc_page_fault+0x70/0x170
[  115.864775]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  115.865187] RIP: 0033:0x7f92c92b0ebe
[  115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89
01 48
[  115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX:
00000000000000a5
[  115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe
[  115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620
[  115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[  115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620
[  115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076
[  115.870581]  </TASK>
[  115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0
sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr
intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev
soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic
pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  115.875288] CR2: 0000000000000010
[  115.875641] ---[ end trace 0000000000000000 ]---
[  115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
[  115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
02 4d
[  115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
[  115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
[  115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
[  115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
[  115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
[  115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
[  115.881548] FS:  00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
knlGS:0000000000000000
[  115.882234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
[  115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: f2ebb3a921c1 ("smarter propagate_mnt()")
Fixes: 5ec0811d3037 ("propogate_mnt: Handle the first propogated copy being a slave")
Cc: <stable@vger.kernel.org>
Reported-by: Ditang Chen <ditang.c@gmail.com>
Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:47 +01:00
Zhang Tianci
83c44f0ebf ovl: Use ovl mounter's fsuid and fsgid in ovl_link()
commit 5b0db51215e895a361bc63132caa7cca36a53d6a upstream.

There is a wrong case of link() on overlay:
  $ mkdir /lower /fuse /merge
  $ mount -t fuse /fuse
  $ mkdir /fuse/upper /fuse/work
  $ mount -t overlay /merge -o lowerdir=/lower,upperdir=/fuse/upper,\
    workdir=work
  $ touch /merge/file
  $ chown bin.bin /merge/file // the file's caller becomes "bin"
  $ ln /merge/file /merge/lnkfile

Then we will get an error(EACCES) because fuse daemon checks the link()'s
caller is "bin", it denied this request.

In the changing history of ovl_link(), there are two key commits:

The first is commit bb0d2b8ad296 ("ovl: fix sgid on directory") which
overrides the cred's fsuid/fsgid using the new inode. The new inode's
owner is initialized by inode_init_owner(), and inode->fsuid is
assigned to the current user. So the override fsuid becomes the
current user. We know link() is actually modifying the directory, so
the caller must have the MAY_WRITE permission on the directory. The
current caller may should have this permission. This is acceptable
to use the caller's fsuid.

The second is commit 51f7e52dc943 ("ovl: share inode for hard link")
which removed the inode creation in ovl_link(). This commit move
inode_init_owner() into ovl_create_object(), so the ovl_link() just
give the old inode to ovl_create_or_link(). Then the override fsuid
becomes the old inode's fsuid, neither the caller nor the overlay's
mounter! So this is incorrect.

Fix this bug by using ovl mounter's fsuid/fsgid to do underlying
fs's link().

Link: https://lore.kernel.org/all/20220817102952.xnvesg3a7rbv576x@wittgenstein/T
Link: https://lore.kernel.org/lkml/20220825130552.29587-1-zhangtianci.1997@bytedance.com/t
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: 51f7e52dc943 ("ovl: share inode for hard link")
Cc: <stable@vger.kernel.org> # v4.8
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12 11:58:46 +01:00