IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit f2e717d655040d632c9015f19aa4275f8b16e7f2 upstream.
RFC3530 notes that the 'dircount' field may be zero, in which case the
recommendation is to ignore it, and only enforce the 'maxcount' field.
In RFC5661, this recommendation to ignore a zero valued field becomes a
requirement.
Fixes: aee377644146 ("nfsd4: fix rd_dircount enforcement")
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1d625050c7c2dd877e108e382b8aaf1ae3cfe1f4 upstream.
init_nfsd() should not unregister pernet subsys if the register fails
but should instead unwind from the last successful operation which is
register_filesystem().
Unregistering a failed register_pernet_subsys() call can result in
a kernel GPF as revealed by programmatically injecting an error in
register_pernet_subsys().
Verified the fix handled failure gracefully with no lingering nfsd
entry in /proc/filesystems. This change was introduced by the commit
bd5ae9288d64 ("nfsd: register pernet ops last, unregister first"),
the original error handling logic was correct.
Fixes: bd5ae9288d64 ("nfsd: register pernet ops last, unregister first")
Cc: stable@vger.kernel.org
Signed-off-by: Patrick Ho <Patrick.Ho@netapp.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8c38b705b4f4ca4e7f9cc116141bc38391917c30 upstream.
silence nfscache allocation warnings with kvzalloc
Currently nfsd_reply_cache_init attempts hash table allocation through
kmalloc, and manually falls back to vzalloc if that fails. This makes
the code a little larger than needed, and creates a significant amount
of serial console spam if you have enough systems.
Switching to kvzalloc gets rid of the allocation warnings, and makes
the code a little cleaner too as a side effect.
Freeing of nn->drc_hashtbl is already done using kvfree currently.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 372d1f3e1bfede719864d0d1fbf3146b1e638c88 ]
The ext2_error() function syncs the filesystem so it sleeps. The caller
is holding a spinlock so it's not allowed to sleep.
ext2_statfs() <- disables preempt
-> ext2_count_free_blocks()
-> ext2_get_group_desc()
Fix this by using WARN() to print an error message and a stack trace
instead of using ext2_error().
Link: https://lore.kernel.org/r/20210921203233.GA16529@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 42cb447410d024e9d54139ae9c21ea132a8c384c upstream.
When ext4_htree_fill_tree() fails, ext4_dx_readdir() can run into an
infinite loop since if info->last_pos != ctx->pos this will reset the
directory scan and reread the failing entry. For example:
1. a dx_dir which has 3 block, block 0 as dx_root block, block 1/2 as
leaf block which own the ext4_dir_entry_2
2. block 1 read ok and call_filldir which will fill the dirent and update
the ctx->pos
3. block 2 read fail, but we has already fill some dirent, so we will
return back to userspace will a positive return val(see ksys_getdents64)
4. the second ext4_dx_readdir will reset the world since info->last_pos
!= ctx->pos, and will also init the curr_hash which pos to block 1
5. So we will read block1 too, and once block2 still read fail, we can
only fill one dirent because the hash of the entry in block1(besides
the last one) won't greater than curr_hash
6. this time, we forget update last_pos too since the read for block2
will fail, and since we has got the one entry, ksys_getdents64 can
return success
7. Latter we will trapped in a loop with step 4~6
Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210914111415.3921954-1-yangerkun@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6fed83957f21eff11c8496e9f24253b03d2bc1dc upstream.
When ext4_insert_delayed block receives and recovers from an error from
ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
space it has reserved for that block insertion as it should. One effect
of this bug is that s_dirtyclusters_counter is not decremented and
remains incorrectly elevated until the file system has been unmounted.
This can result in premature ENOSPC returns and apparent loss of free
space.
Another effect of this bug is that
/sys/fs/ext4/<dev>/delayed_allocation_blocks can remain non-zero even
after syncfs has been executed on the filesystem.
Besides, add check for s_dirtyclusters_counter when inode is going to be
evicted and freed. s_dirtyclusters_counter can still keep non-zero until
inode is written back in .evict_inode(), and thus the check is delayed
to .destroy_inode().
Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree")
Cc: stable@kernel.org
Suggested-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210823061358.84473-1-jefflexu@linux.alibaba.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 75ca6ad408f459f00b09a64f04c774559848c097 upstream.
We should use unsigned long long rather than loff_t to avoid
overflow in ext4_max_bitmap_size() for comparison before returning.
w/o this patch sbi->s_bitmap_maxbytes was becoming a negative
value due to overflow of upper_limit (with has_huge_files as true)
Below is a quick test to trigger it on a 64KB pagesize system.
sudo mkfs.ext4 -b 65536 -O ^has_extents,^64bit /dev/loop2
sudo mount /dev/loop2 /mnt
sudo echo "hello" > /mnt/hello -> This will error out with
"echo: write error: File too large"
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/594f409e2c543e90fd836b78188dfa5c575065ba.1622867594.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b2f72cc0aa4bb444541bb87581c35b7508b37d3 upstream.
In commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.
Unfortunately, this will cause kernel to fail to start with:
1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
Failed to execute /init (error -17)
The reason is that the elf interpreter (ld.so) has overlapping segments.
readelf -l ld-2.31.so
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002c94c 0x000000000002c94c R E 0x10000
LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
0x00000000000021e8 0x0000000000002320 RW 0x10000
LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
0x00000000000011ac 0x0000000000001328 RW 0x10000
The reason for this problem is the same as described in commit
ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").
Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.
Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 80f6e3080bfcf865062a926817b3ca6c4a137a57 upstream.
If the file size is almost S64_MAX, the calculated number of Merkle tree
levels exceeds FS_VERITY_MAX_LEVELS, causing FS_IOC_ENABLE_VERITY to
fail. This is unintentional, since as the comment above the definition
of FS_VERITY_MAX_LEVELS states, it is enough for over U64_MAX bytes of
data using SHA-256 and 4K blocks. (Specifically, 4096*128**8 >= 2**64.)
The bug is actually that when the number of blocks in the first level is
calculated from i_size, there is a signed integer overflow due to i_size
being signed. Fix this by treating i_size as unsigned.
This was found by the new test "generic: test fs-verity EFBIG scenarios"
(https://lkml.kernel.org/r/b1d116cd4d0ea74b9cd86f349c672021e005a75c.1631558495.git.boris@bur.io).
This didn't affect ext4 or f2fs since those have a smaller maximum file
size, but it did affect btrfs which allows files up to S64_MAX bytes.
Reported-by: Boris Burkov <boris@bur.io>
Fixes: 3fda4c617e84 ("fs-verity: implement FS_IOC_ENABLE_VERITY ioctl")
Fixes: fd2d1acfcadf ("fs-verity: add the hook for file ->open()")
Cc: <stable@vger.kernel.org> # v5.4+
Reviewed-by: Boris Burkov <boris@bur.io>
Link: https://lore.kernel.org/r/20210916203424.113376-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d5f6545934c47e97c0b48a645418e877b452a992 upstream.
In commit b7213ffa0e58 ("qnx4: avoid stringop-overread errors") I tried
to teach gcc about how the directory entry structure can be two
different things depending on a status flag. It made the code clearer,
and it seemed to make gcc happy.
However, Arnd points to a gcc bug, where despite using two different
members of a union, gcc then gets confused, and uses the size of one of
the members to decide if a string overrun happens. And not necessarily
the rigth one.
End result: with some configurations, gcc-11 will still complain about
the source buffer size being overread:
fs/qnx4/dir.c: In function 'qnx4_readdir':
fs/qnx4/dir.c:76:32: error: 'strnlen' specified bound [16, 48] exceeds source size 1 [-Werror=stringop-overread]
76 | size = strnlen(name, size);
| ^~~~~~~~~~~~~~~~~~~
fs/qnx4/dir.c:26:22: note: source object declared here
26 | char de_name;
| ^~~~~~~
because gcc will get confused about which union member entry is actually
getting accessed, even when the source code is very clear about it. Gcc
internally will have combined two "redundant" pointers (pointing to
different union elements that are at the same offset), and takes the
size checking from one or the other - not necessarily the right one.
This is clearly a gcc bug, but we can work around it fairly easily. The
biggest thing here is the big honking comment about why we do what we
do.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578#c6
Reported-and-tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b7213ffa0e585feb1aee3e7173e965e66ee0abaa ]
The qnx4 directory entries are 64-byte blocks that have different
contents depending on the a status byte that is in the last byte of the
block.
In particular, a directory entry can be either a "link info" entry with
a 48-byte name and pointers to the real inode information, or an "inode
entry" with a smaller 16-byte name and the full inode information.
But the code was written to always just treat the directory name as if
it was part of that "inode entry", and just extend the name to the
longer case if the status byte said it was a link entry.
That work just fine and gives the right results, but now that gcc is
tracking data structure accesses much more, the code can trigger a
compiler error about using up to 48 bytes (the long name) in a structure
that only has that shorter name in it:
fs/qnx4/dir.c: In function ‘qnx4_readdir’:
fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds source size 16 [-Werror=stringop-overread]
51 | size = strnlen(de->di_fname, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from fs/qnx4/qnx4.h:3,
from fs/qnx4/dir.c:16:
include/uapi/linux/qnx4_fs.h:45:25: note: source object declared here
45 | char di_fname[QNX4_SHORT_NAME_MAX];
| ^~~~~~~~
which is because the source code doesn't really make this whole "one of
two different types" explicit.
Fix this by introducing a very explicit union of the two types, and
basically explaining to the compiler what is really going on.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e946d3c887a9dc33aa82a349c6284f4a084163f4 ]
The problem is the mismatched types between "ctx->total_len" which is
an unsigned int, "rc" which is an int, and "ctx->rc" which is a
ssize_t. The code does:
ctx->rc = (rc == 0) ? ctx->total_len : rc;
We want "ctx->rc" to store the negative "rc" error code. But what
happens is that "rc" is type promoted to a high unsigned int and
'ctx->rc" will store the high positive value instead of a negative
value.
The fix is to change "rc" from an int to a ssize_t.
Fixes: c610c4b619e5 ("CIFS: Add asynchronous write support through kernel AIO")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 63d49d843ef5fffeea069e0ffdfbd2bf40ba01c6 ]
The AFS filesystem is currently triggering the silly-rename cleanup from
afs_d_revalidate() when it sees that a dentry has been changed by a third
party[1]. It should not be doing this as the cleanup includes deleting the
silly-rename target file on iput.
Fix this by removing the places in the d_revalidate handling that validate
anything other than the directory and the dirent. It probably should not
be looking to validate the target inode of the dentry also.
This includes removing the point in afs_d_revalidate() where the inode that
a dentry used to point to was marked as being deleted (AFS_VNODE_DELETED).
We don't know it got deleted. It could have been renamed or it could have
hard links remaining.
This was reproduced by cloning a git repo onto an afs volume on one
machine, switching to another machine and doing "git status", then
switching back to the first and doing "git status". The second status
would show weird output due to ".git/index" getting deleted by the above
mentioned mechanism.
A simpler way to do it is to do:
machine 1: touch a
machine 2: touch b; mv -f b a
machine 1: stat a
on an afs volume. The bug shows up as the stat failing with ENOENT and the
file server log showing that machine 1 deleted "a".
Fixes: 79ddbfa500b3 ("afs: Implement sillyrename for unlink and rename")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c4 [1]
Link: https://lore.kernel.org/r/163111668100.283156.3851669884664475428.stgit@warthog.procyon.org.uk/
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0619b7901473c380abc05d45cf9c70bee0707db3 upstream.
It's not uncommon where __btrfs_dump_space_info() gets called
under over-commit situations.
In that case free space would underflow as total allocated space is not
enough to handle all the over-committed space.
Such underflow values can sometimes cause confusion for users enabled
enospc_debug mount option, and takes some seconds for developers to
convert the underflow value to signed result.
Just output the free space as s64 to avoid such problem.
Reported-by: Eli V <eliventer@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUSy4zgyhf-4d9T+KdJp9w=UgzC2A0V=VtmaeEpcGgm1-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9ed38fd4a15417cac83967360cf20b853bfab9b6 upstream.
Although very unlikely that the tlink pointer would be null in this case,
get_next_mid function can in theory return null (but not an error)
so need to check for null (not for IS_ERR, which can not be returned
here).
Address warning:
fs/smbfs_client/connect.c:2392 cifs_match_super()
warn: 'tlink' isn't an ERR_PTR
Pointed out by Dan Carpenter via smatch code analysis tool
CC: stable@vger.kernel.org
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c0f0a03e386f4e1df33db676401547e1b7800c6 upstream.
ocfs2_data_convert_worker() is currently dropping any cached acl info
for FILE before down-converting meta lock. It should also drop for
DIRECTORY. Otherwise the second acl lookup returns the cached one (from
VFS layer) which could be already stale.
The problem we are seeing is that the acl changes on one node doesn't
get refreshed on other nodes in the following case:
Node 1 Node 2
-------------- ----------------
getfacl dir1
getfacl dir1 <-- this is OK
setfacl -m u:user1:rwX dir1
getfacl dir1 <-- see the change for user1
getfacl dir1 <-- can't see change for user1
Link: https://lkml.kernel.org/r/20210903012631.6099-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 17243e1c3072b8417a5ebfc53065d0a87af7ca77 ]
kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del(). See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".
Link: https://lkml.kernel.org/r/20210629022556.3985106-7-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-7-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b2fe39c248f3fa4bbb2a20759b4fdd83504190f7 ]
If kobject_init_and_add returns with error, kobject_put() is needed here
to avoid memory leak, because kobject_init_and_add may return error
without freeing the memory associated with the kobject it allocated.
Link: https://lkml.kernel.org/r/20210629022556.3985106-6-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-6-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a3e181259ddd61fd378390977a1e4e2316853afa ]
The kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del. See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".
Link: https://lkml.kernel.org/r/20210629022556.3985106-5-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-5-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 24f8cb1ed057c840728167dab33b32e44147c86f ]
If kobject_init_and_add return with error, kobject_put() is needed here to
avoid memory leak, because kobject_init_and_add may return error without
freeing the memory associated with the kobject it allocated.
Link: https://lkml.kernel.org/r/20210629022556.3985106-4-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-4-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dbc6e7d44a514f231a64d9d5676e001b660b6448 ]
In nilfs_##name##_attr_release, kobj->parent should not be referenced
because it is a NULL pointer. The release() method of kobject is always
called in kobject_put(kobj), in the implementation of kobject_put(), the
kobj->parent will be assigned as NULL before call the release() method.
So just use kobj to get the subgroups, which is more efficient and can fix
a NULL pointer reference problem.
Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5f5dec07aca7067216ed4c1342e464e7307a9197 ]
Patch series "nilfs2: fix incorrect usage of kobject".
This patchset from Nanyong Sun fixes memory leak issues and a NULL
pointer dereference issue caused by incorrect usage of kboject in nilfs2
sysfs implementation.
This patch (of 6):
Reported by syzkaller:
BUG: memory leak
unreferenced object 0xffff888100ca8988 (size 8):
comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s)
hex dump (first 8 bytes):
6c 6f 6f 70 31 00 ff ff loop1...
backtrace:
kstrdup+0x36/0x70 mm/util.c:60
kstrdup_const+0x35/0x60 mm/util.c:83
kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48
kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
kobject_add_varg lib/kobject.c:384 [inline]
kobject_init_and_add+0xc9/0x150 lib/kobject.c:473
nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986
init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637
nilfs_fill_super fs/nilfs2/super.c:1046 [inline]
nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316
legacy_get_tree+0x105/0x210 fs/fs_context.c:592
vfs_get_tree+0x8e/0x2d0 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0xf9b/0x1990 fs/namespace.c:3235
do_mount+0xea/0x100 fs/namespace.c:3248
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount fs/namespace.c:3433 [inline]
__x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kobject_init_and_add return with error, then the cleanup of kobject
is needed because memory may be allocated in kobject_init_and_add
without freeing.
And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject. As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up. And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.
Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b11ed50346683a749632ea664959b28d524d7395 ]
The current code will update the mtime and then try to get caps to
handle the write. If we end up having to request caps from the MDS, then
the mtime in the cap grant will clobber the updated mtime and it'll be
lost.
This is most noticable when two clients are alternately writing to the
same file. Fw caps are continually being granted and revoked, and the
mtime ends up stuck because the updated mtimes are always being
overwritten with the old one.
Fix this by changing the order of operations in ceph_write_iter to get
the caps before updating the times. Also, make sure we check the pool
full conditions before even getting any caps or uninlining.
URL: https://tracker.ceph.com/issues/46574
Reported-by: Jozef Kováč <kovac@firma.zoznam.sk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 98e2e409e76ef7781d8511f997359e9c504a95c1 upstream.
When the refcount is decreased to 0, the resource reclamation branch is
entered. Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see
nilfs_lookup_root().
Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late. CPU0 will release 'root' directly, CPU1 then
accesses 'root' and triggers UAF.
Use refcount_dec_and_lock() to ensure that both the operations of
decrease refcount to 0 and link deletion are lock protected eliminates
this risk.
CPU0 CPU1
nilfs_put_root():
<-------- (1)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);
kfree(root);
<-------- use-after-free
refcount_t: underflow; use-after-free.
WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \
refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
Modules linked in:
CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
... ...
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795
nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline]
nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812
nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467
generic_shutdown_super+0xcd/0x1f0 fs/super.c:464
kill_block_super+0x4a/0x90 fs/super.c:1446
deactivate_locked_super+0x6a/0xb0 fs/super.c:335
deactivate_super+0x85/0x90 fs/super.c:366
cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118
__cleanup_mnt+0x15/0x20 fs/namespace.c:1125
task_work_run+0x8e/0x110 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56
entry_SYSCALL_64_after_hwframe+0x44/0xa9
There is no reproduction program, and the above is only theoretical
analysis.
Link: https://lkml.kernel.org/r/1629859428-5906-1-git-send-email-konishi.ryusuke@gmail.com
Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object")
Link: https://lkml.kernel.org/r/20210723012317.4146-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e1e71c168813564be0f6ea3d6740a059ca42d177 ]
There is a potential race between fuse_read_interrupt() and
fuse_request_end().
TASK1
in fuse_read_interrupt(): delete req->intr_entry (while holding
fiq->lock)
TASK2
in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock
wake up TASK3
TASK3
request is freed
TASK1
in fuse_read_interrupt(): dereference req->in.h.unique ***BAM***
Fix by always grabbing fiq->lock if the request was ever interrupted
(FR_INTERRUPTED set) thereby serializing with concurrent
fuse_read_interrupt() calls.
FR_INTERRUPTED is set before the request is queued on fiq->interrupts.
Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not
cleared in this case.
Reported-by: lijiazi <lijiazi@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6f93e834fa7c5faa0372e46828b4b2a966ac61d7 upstream.
The mount option max_inline ranges from 0 to the sectorsize (which is
now equal to page size). But we parse the mount options too early and
before the actual sectorsize is read from the superblock. So the upper
limit of max_inline is unaware of the actual sectorsize and is limited
by the temporary sectorsize 4096, even on a system where the default
sectorsize is 64K.
Fix this by reading the superblock sectorsize before the mount option
parse.
Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 52d5a0c6bd8a89f460243ed937856354f8f253a3 upstream.
If function ovl_instantiate() returns an error, ovl_cleanup will be called
and try to remove newdentry from wdir, but the newdentry has been moved to
udir at this time. This will causes BUG_ON(victim->d_parent->d_inode !=
dir) in fs/namei.c:may_delete.
Signed-off-by: chenying <chenying.kernel@bytedance.com>
Fixes: 01b39dcc9568 ("ovl: use inode_insert5() to hash a newly created inode")
Link: https://lore.kernel.org/linux-unionfs/e6496a94-a161-dc04-c38a-d2544633acb4@bytedance.com/
Cc: <stable@vger.kernel.org> # v4.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d72c74197b70bc3c95152f351a568007bffa3e11 ]
smb_buf is allocated by small_smb_init_no_tc(), and buf type is
CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to
release it in failed path.
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3736127a3aa805602b7a2ad60ec9cfce68065fbb ]
Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
values bellow zero if an error happened.
Function replay_one_extent currently checks if the search found
something (0 returned) and increments the reference, and if not, it
seems to evaluate as 'not found'.
Fix the condition by checking if the value was bellow zero and return
early.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7de875b231edb807387a81cde288aa9e1015ef9e ]
Locks have two sets of op arrays, fl_lmops for the lock manager (lockd
or nfsd), fl_ops for the filesystem. The server-side lockd code has
been setting its own fl_ops, which leads to confusion (and crashes) in
the reexport case, where the filesystem expects to be the only one
setting fl_ops.
And there's no reason for it that I can see-the lm_get/put_owner ops do
the same job.
Reported-by: Daire Byrne <daire@dneg.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d1340f80f0b8066321b499a376780da00560e857 ]
In the gfs2 withdraw sequence, the dlm protocol is unmounted with a call
to lm_unmount. After a withdraw, users are allowed to unmount the
withdrawn file system. But at that point we may still have glocks left
over that we need to free via unmount's call to gfs2_gl_hash_clear.
These glocks may have never been completed because of whatever problem
caused the withdraw (IO errors or whatever).
Before this patch, function gdlm_put_lock would still try to call into
dlm to unlock these leftover glocks, which resulted in dlm returning
-EINVAL because the lock space was abandoned. These glocks were never
freed because there was no mechanism after that to free them.
This patch adds a check to gdlm_put_lock to see if the locking protocol
was inactive (DFL_UNMOUNT flag) and if so, free the glock and not
make the invalid call into dlm.
I could have combined this "if" with the one that follows, related to
leftover glock LVBs, but I felt the code was more readable with its own
if clause.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 22e5fe2a2a279d9a6fcbdfb4dffe73821bef1c90 ]
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c8dc3047c48540183744f959412d44b08c5435e1 ]
We need to unmap pages from userspace process before removing pagecache
in punch_hole() like we did in f2fs_setattr().
Similar change:
commit 5e44f8c374dc ("ext4: hole-punch use truncate_pagecache_range")
Fixes: fbfa2cc58d53 ("f2fs: add file operations")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit adf9ea89c719c1d23794e363f631e376b3ff8cbc ]
In below path, it will return ENOENT if filesystem is shutdown:
- f2fs_map_blocks
- f2fs_get_dnode_of_data
- f2fs_get_node_page
- __get_node_page
- read_node_page
- is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN)
return -ENOENT
- force return value from ENOENT to 0
It should be fine for read case, since it indicates a hole condition,
and caller could use .m_next_pgofs to skip the hole and continue the
lookup.
However it may cause confusing for write case, since leaving a hole
there, and said nothing was wrong doesn't help.
There is at least one case from dax_iomap_actor() will complain that,
so fix this in prior to supporting dax in f2fs.
xfstest generic/388 reports below warning:
ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370
Call Trace:
iomap_apply+0x1c4/0x7b0
? dax_iomap_rw+0x1c0/0x1c0
dax_iomap_rw+0xad/0x1c0
? dax_iomap_rw+0x1c0/0x1c0
f2fs_file_write_iter+0x5ab/0x970 [f2fs]
do_iter_readv_writev+0x273/0x2e0
do_iter_write+0xab/0x1f0
vfs_iter_write+0x21/0x40
iter_file_splice_write+0x287/0x540
do_splice+0x37c/0xa60
__x64_sys_splice+0x15f/0x3a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0
Call Trace:
dax_iomap_fault+0x44/0x70
f2fs_dax_huge_fault+0x155/0x400 [f2fs]
f2fs_dax_fault+0x18/0x30 [f2fs]
__do_fault+0x4e/0x120
do_fault+0x3cf/0x7a0
__handle_mm_fault+0xa8c/0xf20
? find_held_lock+0x39/0xd0
handle_mm_fault+0x1b6/0x480
do_user_addr_fault+0x320/0xcd0
? rcu_read_lock_sched_held+0x67/0xc0
exc_page_fault+0x77/0x3f0
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
Fixes: 83a3bfdb5a8a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ad126ebddecbf696e0cf214ff56c7b170fa9f0f7 ]
There is a missing place we forgot to account .skipped_gc_rwsem, fix it.
Fixes: 6f8d4455060d ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 35b72573e977ed6b18b094136a4fa3e0ffb13603 ]
The current hash algorithm used for hashing cookie keys is really bad,
producing almost no dispersion (after a test kernel build, ~30000 files
were split over just 18 out of the 32768 hash buckets).
Borrow the full_name_hash() hash function into fscache to do the hashing
for cookie keys and, in the future, volume keys.
I don't want to use full_name_hash() as-is because I want the hash value to
be consistent across arches and over time as the hash value produced may
get used on disk.
I can also optimise parts of it away as the key will always be a padded
array of aligned 32-bit words.
Fixes: ec0328e46d6e ("fscache: Maintain a catalogue of allocated cookies")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431201844.2908479.8293647220901514696.stgit@warthog.procyon.org.uk/
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d4bf15a7ce172d186d400d606adf4f34a59130d6 ]
I recently found a case where de->name_len is 0 in f2fs_fill_dentries()
easily reproduced, and finally set the fsck flag.
Thread A Thread B
- f2fs_readdir
- f2fs_read_inline_dir
- ctx->pos = d.max
- f2fs_add_dentry
- f2fs_add_inline_entry
- do_convert_inline_dir
- f2fs_add_regular_entry
- f2fs_readdir
- f2fs_fill_dentries
- set_sbi_flag(sbi, SBI_NEED_FSCK)
Process A opens the folder, and has been reading without closing it.
During this period, Process B created a file under the folder (occupying
multiple f2fs_dir_entry, exceeding the d.max of the inline dir). After
creation, process A uses the d.max of inline dir to read it again, and
it will read that de->name_len is 0.
And Chao pointed out that w/o inline conversion, the race condition still
can happen as below:
dir_entry1: A
dir_entry2: B
dir_entry3: C
free slot: _
ctx->pos: ^
Thread A is traversing directory,
ctx-pos moves to below position after readdir() by thread A:
AAAABBBB___
^
Then thread B delete dir_entry2, and create dir_entry3.
Thread A calls readdir() to lookup dirents starting from middle
of new dirent slots as below:
AAAACCCCCC_
^
In these scenarios, the file system is not damaged, and it's hard to
avoid it. But we can bypass tagging FSCK flag if:
a) bit_pos (:= ctx->pos % d->max) is non-zero and
b) before bit_pos moves to first valid dir_entry.
Fixes: ddf06b753a85 ("f2fs: fix to trigger fsck if dirent.name_len is zero")
Signed-off-by: Yangtao Li <frank.li@vivo.com>
[Chao: clean up description]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c45d6002ff7a322022560e9b19ad867b01fec77f ]
As Eric mentioned, bare printk{,_ratelimited} won't show which
filesystem instance these message is coming from, this patch tries
to show fs instance with sb->s_id field in all places we missed
before.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0d977e0eba234e01a60bdde27314dc21374201b3 upstream.
This crash was observed with a failed assertion on device close:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
Call Trace:
flush_space+0x197/0x2f0 [btrfs]
btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
process_one_work+0x262/0x5e0
worker_thread+0x4c/0x320
? process_one_work+0x5e0/0x5e0
kthread+0x144/0x170
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
irq event stamp: 19334989
hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
---[ end trace 45939e308e0dd3c7 ]---
BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
BTRFS info (device vdd): forced readonly
BTRFS warning (device vdd): failed setting block group ro: -30
BTRFS info (device vdd): suspending dev_replace for unmount
assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
Call Trace:
btrfs_close_one_device.cold+0x11/0x55 [btrfs]
close_fs_devices+0x44/0xb0 [btrfs]
btrfs_close_devices+0x48/0x160 [btrfs]
generic_shutdown_super+0x69/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x2c/0xa0
cleanup_mnt+0x144/0x1b0
task_work_run+0x59/0xa0
exit_to_user_mode_loop+0xe7/0xf0
exit_to_user_mode_prepare+0xaf/0xf0
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x4a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:
close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
&device->dev_state));
However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.
To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ac98141d140444fe93e26471d3074c603b70e2ca upstream.
We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing. The reason for this is we need to see any ordered extents
that were created by our delalloc flushing. However we're waking up
before we do the submit work, which is before we create the ordered
extents. This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely. Fix this by waking us up after we've done the work to
create ordered extents.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59bda8ecee2ffc6a602b7bf2b9e43ca669cdbdcd upstream.
Callers of fuse_writeback_range() assume that the file is ready for
modification by the server in the supplied byte range after the call
returns.
If there's a write that extends the file beyond the end of the supplied
range, then the file needs to be extended to at least the end of the range,
but currently that's not done.
There are at least two cases where this can cause problems:
- copy_file_range() will return short count if the file is not extended
up to end of the source range.
- FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
hence the region may not be fully allocated.
Fix by flushing writes from the start of the range up to the end of the
file. This could be optimized if the writes are non-extending, etc, but
it's probably not worth the trouble.
Fixes: a2bc92362941 ("fuse: fix copy_file_range() in the writeback case")
Fixes: 6b1bdb56b17c ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
Cc: <stable@vger.kernel.org> # v5.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 76224355db7570cbe6b6f75c8929a1558828dd55 upstream.
fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().
Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.
Reported-by: Xie Yongji <xieyongji@bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f980d055a0f858d73d9467bb0b570721bbfcdfb8 ]
strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated.
Also, the strnlen() call does not avoid the read overflow in the strlcpy
function when a not NUL-terminated string is passed.
So, replace this block by a call to kstrndup() that avoids this type of
overflow and does the same.
Fixes: 066ce6899484d ("cifs: rename cifs_strlcpy_to_host and make it use new functions")
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 36ca7943ac18aebf8aad4c50829eb2ea5ec847df ]
When the max pages (last_page in the swap header + 1) is smaller than
the total pages (inode size) of the swapfile, iomap_swapfile_activate
overwrites sis->max with total pages.
However, frontswap_map is a swap page state bitmap allocated using the
initial sis->max page count read from the swap header. If swapfile
activation increases sis->max, it's possible for the frontswap code to
walk off the end of the bitmap, thereby corrupting kernel memory.
[djwong: modify the description a bit; the original paragraph reads:
"However, frontswap_map is allocated using max pages. When test and clear
the sis offset, which is larger than max pages, of frontswap_map in
__frontswap_invalidate_page(), neighbors of frontswap_map may be
overwritten, i.e., slab is polluted."
Note also that this bug resulted in a behavioral change: activating a
swap file that was formatted and later extended results in all pages
being activated, not the number of pages recorded in the swap header.]
This fixes the issue by considering the limitation of max pages of swap
info in iomap_swapfile_add_extent().
To reproduce the case, compile kernel with slub RED ZONE, then run test:
$ sudo stress-ng -a 1 -x softlockup,resources -t 72h --metrics --times \
--verify -v -Y /root/tmpdir/stress-ng/stress-statistic-12.yaml \
--log-file /root/tmpdir/stress-ng/stress-logfile-12.txt \
--temp-path /root/tmpdir/stress-ng/
We'll get the error log as below:
[ 1151.015141] =============================================================================
[ 1151.016489] BUG kmalloc-16 (Not tainted): Right Redzone overwritten
[ 1151.017486] -----------------------------------------------------------------------------
[ 1151.017486]
[ 1151.018997] Disabling lock debugging due to kernel taint
[ 1151.019873] INFO: 0x0000000084e43932-0x0000000098d17cae @offset=7392. First byte 0x0 instead of 0xcc
[ 1151.021303] INFO: Allocated in __do_sys_swapon+0xcf6/0x1170 age=43417 cpu=9 pid=3816
[ 1151.022538] __slab_alloc+0xe/0x20
[ 1151.023069] __kmalloc_node+0xfd/0x4b0
[ 1151.023704] __do_sys_swapon+0xcf6/0x1170
[ 1151.024346] do_syscall_64+0x33/0x40
[ 1151.024925] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1151.025749] INFO: Freed in put_cred_rcu+0xa1/0xc0 age=43424 cpu=3 pid=2041
[ 1151.026889] kfree+0x276/0x2b0
[ 1151.027405] put_cred_rcu+0xa1/0xc0
[ 1151.027949] rcu_do_batch+0x17d/0x410
[ 1151.028566] rcu_core+0x14e/0x2b0
[ 1151.029084] __do_softirq+0x101/0x29e
[ 1151.029645] asm_call_irq_on_stack+0x12/0x20
[ 1151.030381] do_softirq_own_stack+0x37/0x40
[ 1151.031037] do_softirq.part.15+0x2b/0x30
[ 1151.031710] __local_bh_enable_ip+0x4b/0x50
[ 1151.032412] copy_fpstate_to_sigframe+0x111/0x360
[ 1151.033197] __setup_rt_frame+0xce/0x480
[ 1151.033809] arch_do_signal+0x1a3/0x250
[ 1151.034463] exit_to_user_mode_prepare+0xcf/0x110
[ 1151.035242] syscall_exit_to_user_mode+0x27/0x190
[ 1151.035970] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1151.036795] INFO: Slab 0x000000003b9de4dc objects=44 used=9 fp=0x00000000539e349e flags=0xfffffc0010201
[ 1151.038323] INFO: Object 0x000000004855ba01 @offset=7376 fp=0x0000000000000000
[ 1151.038323]
[ 1151.039683] Redzone 000000008d0afd3d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................
[ 1151.041180] Object 000000004855ba01: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1151.042714] Redzone 0000000084e43932: 00 00 00 c0 cc cc cc cc ........
[ 1151.044120] Padding 000000000864c042: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 1151.045615] CPU: 5 PID: 3816 Comm: stress-ng Tainted: G B 5.10.50+ #7
[ 1151.046846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 1151.048633] Call Trace:
[ 1151.049072] dump_stack+0x57/0x6a
[ 1151.049585] check_bytes_and_report+0xed/0x110
[ 1151.050320] check_object+0x1eb/0x290
[ 1151.050924] ? __x64_sys_swapoff+0x39a/0x540
[ 1151.051646] free_debug_processing+0x151/0x350
[ 1151.052333] __slab_free+0x21a/0x3a0
[ 1151.052938] ? _cond_resched+0x2d/0x40
[ 1151.053529] ? __vunmap+0x1de/0x220
[ 1151.054139] ? __x64_sys_swapoff+0x39a/0x540
[ 1151.054796] ? kfree+0x276/0x2b0
[ 1151.055307] kfree+0x276/0x2b0
[ 1151.055832] __x64_sys_swapoff+0x39a/0x540
[ 1151.056466] do_syscall_64+0x33/0x40
[ 1151.057084] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1151.057866] RIP: 0033:0x150340b0ffb7
[ 1151.058481] Code: Unable to access opcode bytes at RIP 0x150340b0ff8d.
[ 1151.059537] RSP: 002b:00007fff7f4ee238 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
[ 1151.060768] RAX: ffffffffffffffda RBX: 00007fff7f4ee66c RCX: 0000150340b0ffb7
[ 1151.061904] RDX: 000000000000000a RSI: 0000000000018094 RDI: 00007fff7f4ee860
[ 1151.063033] RBP: 00007fff7f4ef980 R08: 0000000000000000 R09: 0000150340a672bd
[ 1151.064135] R10: 00007fff7f4edca0 R11: 0000000000000246 R12: 0000000000018094
[ 1151.065253] R13: 0000000000000005 R14: 000000000160d930 R15: 00007fff7f4ee66c
[ 1151.066413] FIX kmalloc-16: Restoring 0x0000000084e43932-0x0000000098d17cae=0xcc
[ 1151.066413]
[ 1151.067890] FIX kmalloc-16: Object at 0x000000004855ba01 not freed
Fixes: 67482129cdab ("iomap: add a swapfile activation function")
Fixes: a45c0eccc564 ("iomap: move the swapfile code into a separate file")
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f7104cc1a9159cd0d3e8526cb638ae0301de4b61 ]
This should use the network-namespace-wide client_lock, not the
per-client cl_lock.
You shouldn't see any bugs unless you're actually using the
forced-expiry interface introduced by 89c905beccbb.
Fixes: 89c905beccbb "nfsd: allow forced expiration of NFSv4 clients"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>