IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
[ Upstream commit 9b56adcf525522e9ffa52471260298d91fc1d395 ]
When compressed file has blocks, f2fs_ioc_start_atomic_write will succeed,
but compressed flag will be remained in inode. If write partial compreseed
cluster and commit atomic write will cause data corruption.
This is the reproduction process:
Step 1:
create a compressed file ,write 64K data , call fsync(), then the blocks
are write as compressed cluster.
Step2:
iotcl(F2FS_IOC_START_ATOMIC_WRITE) --- this should be fail, but not.
write page 0 and page 3.
iotcl(F2FS_IOC_COMMIT_ATOMIC_WRITE) -- page 0 and 3 write as normal file,
Step3:
drop cache.
read page 0-4 -- Since page 0 has a valid block address, read as
non-compressed cluster, page 1 and 2 will be filled with compressed data
or zero.
The root cause is, after commit 7eab7a696827 ("f2fs: compress: remove
unneeded read when rewrite whole cluster"), in step 2, f2fs_write_begin()
only set target page dirty, and in f2fs_commit_inmem_pages(), we will write
partial raw pages into compressed cluster, result in corrupting compressed
cluster layout.
Fixes: 4c8ff7095bef ("f2fs: support data compression")
Fixes: 7eab7a696827 ("f2fs: compress: remove unneeded read when rewrite whole cluster")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7eab7a6968278c735b1ca6387056a408f7960265 ]
when we overwrite the whole page in cluster, we don't need read original
data before write, because after write_end(), writepages() can help to
load left data in that cluster.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1f4613cdbe7739ce291554b316bff8e551383389 ]
When reflinking an inline extent, we assert that its file offset is 0 and
that its uncompressed length is not greater than the sector size. We then
return an error if one of those conditions is not satisfied. However we
use a return statement, which results in returning from btrfs_clone()
without freeing the path and buffer that were allocated before, as well as
not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination
inode.
Fix that by jumping to the 'out' label instead, and also add a WARN_ON()
for each condition so that in case assertions are disabled, we get to
known which of the unexpected conditions triggered the error.
Fixes: a61e1e0df9f321 ("Btrfs: simplify inline extent handling when doing reflinks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 344150999b7fc88502a65bbb147a47503eca2033 ]
Quoted from Jing Xia's report, there is a potential deadlock may happen
between kworker and checkpoint as below:
[T:writeback] [T:checkpoint]
- wb_writeback
- blk_start_plug
bio contains NodeA was plugged in writeback threads
- do_writepages -- sync write inodeB, inc wb_sync_req[DATA]
- f2fs_write_data_pages
- f2fs_write_single_data_page -- write last dirty page
- f2fs_do_write_data_page
- set_page_writeback -- clear page dirty flag and
PAGECACHE_TAG_DIRTY tag in radix tree
- f2fs_outplace_write_data
- f2fs_update_data_blkaddr
- f2fs_wait_on_page_writeback -- wait NodeA to writeback here
- inode_dec_dirty_pages
- writeback_sb_inodes
- writeback_single_inode
- do_writepages
- f2fs_write_data_pages -- skip writepages due to wb_sync_req[DATA]
- wbc->pages_skipped += get_dirty_pages() -- PAGECACHE_TAG_DIRTY is not set but get_dirty_pages() returns one
- requeue_inode -- requeue inode to wb->b_dirty queue due to non-zero.pages_skipped
- blk_finish_plug
Let's try to avoid deadlock condition by forcing unplugging previous bio via
blk_finish_plug(current->plug) once we'v skipped writeback in writepages()
due to valid sbi->wb_sync_req[DATA/NODE].
Fixes: 687de7f1010c ("f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4d2eeafecd6c83b4444db3dc0ada201c89b1aa44 ]
The nfsd file cache table can be pretty large and its allocation
may require as many as 80 contigious pages.
Employ the same fix that was employed for similar issue that was
reported for the reply cache hash table allocation several years ago
by commit 8f97514b423a ("nfsd: more robust allocation failure handling
in nfsd_reply_cache_init").
Fixes: 65294c1f2c5e ("nfsd: add a new struct file caching facility to nfsd")
Link: https://lore.kernel.org/linux-nfs/e3cdaeec85a6cfec980e87fc294327c0381c1778.camel@kernel.org/
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7d19e3dab0002e527052b0aaf986e8c32e5537bf ]
It needs to assign sbi->gc_mode with GC_IDLE_AT rather than GC_AT when
user tries to enable ATGC via gc_idle sysfs interface, fix it.
Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection")
Cc: Zhipeng Tan <tanzhipeng@hust.edu.cn>
Signed-off-by: Jicheng Shao <shaojicheng@hust.edu.cn>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5e929367468c8f97cd1ffb0417316cecfebef94b ]
The fix for not advancing the iterator if we're using fixed buffers is
broken in that it can hit a condition where we don't terminate the loop.
This results in io-wq looping forever, asking to read (or write) 0 bytes
for every subsequent loop.
Reported-by: Joel Jaeschke <joel.jaeschke@gmail.com>
Link: https://github.com/axboe/liburing/issues/549
Fixes: 16c8d2df7ec0 ("io_uring: ensure symmetry in handling iter types in loop_rw_iter()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.
Quoting[1] Ariadne Conill:
"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:
The argument arg0 should point to a filename string that is
associated with the process being started by one of the exec
functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.
This issue is being tracked in the KSPP issue tracker[5]."
While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.
The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.
Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:
process './argc0' launched './argc0' with NULL argv: empty string added
Additionally WARN() and reject NULL argv usage for kernel threads.
[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8126b1c73108bc691f5643df19071a59a69d0bc6 upstream.
pstore_dump() is *always* invoked in atomic context (nowadays in an RCU
read-side critical section, before that under a spinlock).
It doesn't make sense to try to use semaphores here.
This is mostly a revert of commit ea84b580b955 ("pstore: Convert buf_lock
to semaphore"), except that two parts aren't restored back exactly as they
were:
- keep the lock initialization in pstore_register
- in efi_pstore_write(), always set the "block" flag to false
- omit "is_locked", that was unnecessary since
commit 959217c84c27 ("pstore: Actually give up during locking failure")
- fix the bailout message
The actual problem that the buggy commit was trying to address may have
been that the use of preemptible() in efi_pstore_write() was wrong - it
only looks at preempt_count() and the state of IRQs, but __rcu_read_lock()
doesn't touch either of those under CONFIG_PREEMPT_RCU.
(Sidenote: CONFIG_PREEMPT_RCU means that the scheduler can preempt tasks in
RCU read-side critical sections, but you're not allowed to actively
block/reschedule.)
Lockdep probably never caught the problem because it's very rare that you
actually hit the contended case, so lockdep always just sees the
down_trylock(), not the down_interruptible(), and so it can't tell that
there's a problem.
Fixes: ea84b580b955 ("pstore: Convert buf_lock to semaphore")
Cc: stable@vger.kernel.org
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220314185953.2068993-1-jannh@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7aab5c84a0f6ec2290e2ba4a6b245178b1bf949a upstream.
We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
[ 110.920551] ext4_empty_dir: inject fault
[ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13 (...)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 13 ref count is 3, should be 2. Fix<y>? yes
Pass 5: Checking group summary information
/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks
ext4_rmdir
if (!ext4_empty_dir(inode))
goto end_rmdir;
ext4_empty_dir
bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
if (IS_ERR(bh))
return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 84158b7f6a0624b81800b4e7c90f7fb7fdecf66c upstream.
When I rewrote the VMA dumping logic for coredumps, I changed it to
recognize ELF library mappings based on the file being executable instead
of the mapping having an ELF header. But turns out, distros ship many ELF
libraries as non-executable, so the heuristic goes wrong...
Restore the old behavior where FILTER(ELF_HEADERS) dumps the first page of
any offset-0 readable mapping that starts with the ELF magic.
This fix is technically layer-breaking a bit, because it checks for
something ELF-specific in fs/coredump.c; but since we probably want to
share this between standard ELF and FDPIC ELF anyway, I guess it's fine?
And this also keeps the change small for backporting.
Cc: stable@vger.kernel.org
Fixes: 429a22e776a2 ("coredump: rework elf/elf_fdpic vma_dump_size() into common helper")
Reported-by: Bill Messmer <wmessmer@microsoft.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220126025739.2014888-1-jannh@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4c7c44ee1650677fbe89d86edbad9497b7679b5c upstream.
When we mount a jffs2 image, assume that the first few blocks of
the image are normal and contain at least one xattr-related inode,
but the next block is abnormal. As a result, an error is returned
in jffs2_scan_eraseblock(). jffs2_clear_xattr_subsystem() is then
called in jffs2_build_filesystem() and then again in
jffs2_do_fill_super().
Finally we can observe the following report:
==================================================================
BUG: KASAN: use-after-free in jffs2_clear_xattr_subsystem+0x95/0x6ac
Read of size 8 at addr ffff8881243384e0 by task mount/719
Call Trace:
dump_stack+0x115/0x16b
jffs2_clear_xattr_subsystem+0x95/0x6ac
jffs2_do_fill_super+0x84f/0xc30
jffs2_fill_super+0x2ea/0x4c0
mtd_get_sb+0x254/0x400
mtd_get_sb_by_nr+0x4f/0xd0
get_tree_mtd+0x498/0x840
jffs2_get_tree+0x25/0x30
vfs_get_tree+0x8d/0x2e0
path_mount+0x50f/0x1e50
do_mount+0x107/0x130
__se_sys_mount+0x1c5/0x2f0
__x64_sys_mount+0xc7/0x160
do_syscall_64+0x45/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Allocated by task 719:
kasan_save_stack+0x23/0x60
__kasan_kmalloc.constprop.0+0x10b/0x120
kasan_slab_alloc+0x12/0x20
kmem_cache_alloc+0x1c0/0x870
jffs2_alloc_xattr_ref+0x2f/0xa0
jffs2_scan_medium.cold+0x3713/0x4794
jffs2_do_mount_fs.cold+0xa7/0x2253
jffs2_do_fill_super+0x383/0xc30
jffs2_fill_super+0x2ea/0x4c0
[...]
Freed by task 719:
kmem_cache_free+0xcc/0x7b0
jffs2_free_xattr_ref+0x78/0x98
jffs2_clear_xattr_subsystem+0xa1/0x6ac
jffs2_do_mount_fs.cold+0x5e6/0x2253
jffs2_do_fill_super+0x383/0xc30
jffs2_fill_super+0x2ea/0x4c0
[...]
The buggy address belongs to the object at ffff8881243384b8
which belongs to the cache jffs2_xattr_ref of size 48
The buggy address is located 40 bytes inside of
48-byte region [ffff8881243384b8, ffff8881243384e8)
[...]
==================================================================
The triggering of the BUG is shown in the following stack:
-----------------------------------------------------------
jffs2_fill_super
jffs2_do_fill_super
jffs2_do_mount_fs
jffs2_build_filesystem
jffs2_scan_medium
jffs2_scan_eraseblock <--- ERROR
jffs2_clear_xattr_subsystem <--- free
jffs2_clear_xattr_subsystem <--- free again
-----------------------------------------------------------
An error is returned in jffs2_do_mount_fs(). If the error is returned
by jffs2_sum_init(), the jffs2_clear_xattr_subsystem() does not need to
be executed. If the error is returned by jffs2_build_filesystem(), the
jffs2_clear_xattr_subsystem() also does not need to be executed again.
So move jffs2_clear_xattr_subsystem() from 'out_inohash' to 'out_root'
to fix this UAF problem.
Fixes: aa98d7cf59b5 ("[JFFS2][XATTR] XATTR support on JFFS2 (version. 5)")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5b5b4f85b01604389f7a0f11ef180a725bf0e2d4 upstream.
As bughunter reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215709
f2fs may hang when mounting a fuzzed image, the dmesg shows as below:
__filemap_get_folio+0x3a9/0x590
pagecache_get_page+0x18/0x60
__get_meta_page+0x95/0x460 [f2fs]
get_checkpoint_version+0x2a/0x1e0 [f2fs]
validate_checkpoint+0x8e/0x2a0 [f2fs]
f2fs_get_valid_checkpoint+0xd0/0x620 [f2fs]
f2fs_fill_super+0xc01/0x1d40 [f2fs]
mount_bdev+0x18a/0x1c0
f2fs_mount+0x15/0x20 [f2fs]
legacy_get_tree+0x28/0x50
vfs_get_tree+0x27/0xc0
path_mount+0x480/0xaa0
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is cp_pack_total_block_count field in checkpoint was fuzzed
to one, as calcuated, two cp pack block locates in the same block address,
so then read latter cp pack block, it will block on the page lock due to
the lock has already held when reading previous cp pack block, fix it by
adding sanity check for cp_pack_total_block_count.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6d18762ed5cd549fde74fd0e05d4d87bac5a3beb upstream.
As Pavel Machek reported in below link [1]:
After commit 77900c45ee5c ("f2fs: fix to do sanity check in is_alive()"),
node page should be unlock via calling f2fs_put_page() in the error path
of is_alive(), otherwise, f2fs may hang when it tries to lock the node
page, fix it.
[1] https://lore.kernel.org/stable/20220124203637.GA19321@duo.ucw.cz/
Fixes: 77900c45ee5c ("f2fs: fix to do sanity check in is_alive()")
Cc: <stable@vger.kernel.org>
Reported-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 184416d4b98509fb4c3d8fc3d6dc1437896cc159 upstream.
Smatch complains:
fs/nfsd/nfsxdr.c:341 nfssvc_decode_writeargs()
warn: no lower bound on 'args->len'
Change the type to unsigned to prevent this issue.
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 839a534f1e853f1aec100d06040c0037b89c2dc3 upstream.
In d_make_root, when we fail to allocate dentry for root inode,
we will iput root inode and returned value is NULL in this function.
So we do not need to release this inode again at d_make_root's caller.
Signed-off-by: Chen Li <chenli@uniontech.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7b0b1332cfdb94489836b67d088a779699f8e47e upstream.
Once s_root is set, genric_shutdown_super() will be called if
fill_super() fails. That means, we will call ocfs2_dismount_volume()
twice in such case, which can lead to kernel crash.
Fix this issue by initializing filecheck kobj before setting s_root.
Link: https://lkml.kernel.org/r/20220310081930.86305-1-joseph.qi@linux.alibaba.com
Fixes: 5f483c4abb50 ("ocfs2: add kobject for online file check")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f70865db5ff35f5ed0c7e9ef63e7cca3d4947f04 upstream.
Revert of revert of "io_uring: wait potential ->release() on resurrect",
which adds a helper for resurrect not racing completion reinit, as was
removed because of a strange bug with no clear root or link to the
patch.
Was improved, instead of rcu_synchronize(), just wait_for_completion()
because we're at 0 refs and it will happen very shortly. Specifically
use non-interruptible version to ignore all pending signals that may
have ended prior interruptible wait.
This reverts commit cb5e1b81304e089ee3ca948db4d29f71902eb575.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a080c20f686d026efade810b116b72f88abaff9.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b1489186cc8391e0c1e342f9fbc3eedf6b944c61 upstream.
The in-kernel ext4 resize code doesn't support filesystem with the
sparse_super2 feature. It fails with errors like this and doesn't finish
the resize:
EXT4-fs (loop0): resizing filesystem from 16640 to 7864320 blocks
EXT4-fs warning (device loop0): verify_reserved_gdb:760: reserved GDT 2 missing grp 1 (32770)
EXT4-fs warning (device loop0): ext4_resize_fs:2111: error (-22) occurred during file system resize
EXT4-fs (loop0): resized filesystem to 2097152
To reproduce:
mkfs.ext4 -b 4096 -I 256 -J size=32 -E resize=$((256*1024*1024)) -O sparse_super2 ext4.img 65M
truncate -s 30G ext4.img
mount ext4.img /mnt
python3 -c 'import fcntl, os, struct ; fd = os.open("/mnt", os.O_RDONLY | os.O_DIRECTORY) ; fcntl.ioctl(fd, 0x40086610, struct.pack("Q", 30 * 1024 * 1024 * 1024 // 4096), False) ; os.close(fd)'
dmesg | tail
e2fsck ext4.img
The userspace resize2fs tool has a check for this case: it checks if the
filesystem has sparse_super2 set and if the kernel provides
/sys/fs/ext4/features/sparse_super2. However, the former check requires
manually reading and parsing the filesystem superblock.
Detect this case in ext4_resize_begin and error out early with a clear
error message.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/74b8ae78405270211943cd7393e65586c5faeed1.1623093259.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2ed147f015af2b48f41c6f0b6746aa9ea85c19f3 upstream.
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit db8facfc9fafacefe8a835416a6b77c838088f8b upstream.
In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.
Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared. We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0c4bcfdecb1ac0967619ee7ff44871d93c08c909 upstream.
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.
On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.
This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.
Fix by copying pages coming from the user address space to new pipe
buffers.
Reported-by: Jann Horn <jannh@google.com>
Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4751dc99627e4d1465c5bfa8cb7ab31ed418eff5 upstream.
During log replay, whenever we need to check if a name (dentry) exists in
a directory we do searches on the subvolume tree for inode references or
or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
keys as well, before kernel 5.17). However when during log replay we
unlink a name, through btrfs_unlink_inode(), we may not delete inode
references and dir index keys from a subvolume tree and instead just add
the deletions to the delayed inode's delayed items, which will only be
run when we commit the transaction used for log replay. This means that
after an unlink operation during log replay, if we attempt to search for
the same name during log replay, we will not see that the name was already
deleted, since the deletion is recorded only on the delayed items.
We run delayed items after every unlink operation during log replay,
except at unlink_old_inode_refs() and at add_inode_ref(). This was due
to an overlook, as delayed items should be run after evert unlink, for
the reasons stated above.
So fix those two cases.
Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation")
Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d4aef1e122d8bbdc15ce3bd0bc813d6b44a7d63a upstream.
The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and
qgroup rescan worker") by Kawasaki resolves deadlock between quota
disable and qgroup rescan worker. But also there is a deadlock case like
it. It's about enabling or disabling quota and creating or removing
qgroup. It can be reproduced in simple script below.
for i in {1..100}
do
btrfs quota enable /mnt &
btrfs qgroup create 1/0 /mnt &
btrfs qgroup destroy 1/0 /mnt &
btrfs quota disable /mnt &
done
Here's why the deadlock happens:
1) The quota rescan task is running.
2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
the quota rescan task to complete.
3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
the qgroup_ioctl_lock mutex, because it's being held by task A. At that
point task B is holding a transaction handle for the current transaction.
4) The quota rescan task calls btrfs_commit_transaction(). This results
in it waiting for all other tasks to release their handles on the
transaction, but task B is blocked on the qgroup_ioctl_lock mutex
while holding a handle on the transaction, and that mutex is being held
by task A, which is waiting for the quota rescan task to complete,
resulting in a deadlock between these 3 tasks.
To resolve this issue, the thread disabling quota should unlock
qgroup_ioctl_lock before waiting rescan completion. Move
btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d99478874355d3a7b9d86dfb5d7590d5b1754b1f upstream.
When doing a full fsync, if we have prealloc extents beyond (or at) eof,
and the leaves that contain them were not modified in the current
transaction, we end up not logging them. This results in losing those
extents when we replay the log after a power failure, since the inode is
truncated to the current value of the logged i_size.
Just like for the fast fsync path, we need to always log all prealloc
extents starting at or beyond i_size. The fast fsync case was fixed in
commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay") but it missed the full fsync path. The problem
exists since the very early days, when the log tree was added by
commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
synchronous operations").
Example reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with many file extent items, so that they span
# several leaves of metadata, even if the node/page size is 64K. Use
# direct IO and not fsync/O_SYNC because it's both faster and it avoids
# clearing the full sync flag from the inode - we want the fsync below
# to trigger the slow full sync code path.
$ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
# Now add two preallocated extents to our file without extending the
# file's size. One right at i_size, and another further beyond, leaving
# a gap between the two prealloc extents.
$ xfs_io -c "falloc -k 16M 1M" /mnt/foo
$ xfs_io -c "falloc -k 20M 1M" /mnt/foo
# Make sure everything is durably persisted and the transaction is
# committed. This makes all created extents to have a generation lower
# than the generation of the transaction used by the next write and
# fsync.
sync
# Now overwrite only the first extent, which will result in modifying
# only the first leaf of metadata for our inode. Then fsync it. This
# fsync will use the slow code path (inode full sync bit is set) because
# it's the first fsync since the inode was created/loaded.
$ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
# Extent list before power failure.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x0
3: [32768..34815]: 2172928..2174975 2048 0x800
4: [34816..40959]: hole 6144
5: [40960..43007]: 2174976..2177023 2048 0x801
<power fail>
# Mount fs again, trigger log replay.
$ mount /dev/sdc /mnt
# Extent list after power failure and log replay.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x1
# The prealloc extents at file offsets 16M and 20M are missing.
So fix this by calling btrfs_log_prealloc_extents() when we are doing a
full fsync, so that we always log all prealloc extents beyond eof.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 92fba084b79e6bc7b12fc118209f1922c1a2df56 ]
In exfat_truncate(), the computation of inode->i_blocks is wrong if
the file is larger than 4 GiB because a 32-bit variable is used as a
mask. This is fixed and simplified by using round_up().
Also fix the same buggy computation in exfat_read_root() and another
(correct) one in exfat_fill_inode(). The latter was fixed another way
last month but can be simplified by using round_up() as well. See:
commit 0c336d6e33f4 ("exfat: fix incorrect loading of i_blocks for
large files")
Fixes: 98d917047e8b ("exfat: add file operations")
Cc: stable@vger.kernel.org # v5.7+
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Christophe Vu-Brugier <christophe.vu-brugier@seagate.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3d6cc9898efdfb062efb74dc18cfc700e082f5d5 ]
When cifs_get_root() fails during cifs_smb3_do_mount() we call
deactivate_locked_super() which eventually will call delayed_free() which
will free the context.
In this situation we should not proceed to enter the out: section in
cifs_smb3_do_mount() and free the same resources a second time.
[Thu Feb 10 12:59:06 2022] BUG: KASAN: use-after-free in rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] Read of size 8 at addr ffff888364f4d110 by task swapper/1/0
[Thu Feb 10 12:59:06 2022] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 5.17.0-rc3+ #4
[Thu Feb 10 12:59:06 2022] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[Thu Feb 10 12:59:06 2022] Call Trace:
[Thu Feb 10 12:59:06 2022] <IRQ>
[Thu Feb 10 12:59:06 2022] dump_stack_lvl+0x5d/0x78
[Thu Feb 10 12:59:06 2022] print_address_description.constprop.0+0x24/0x150
[Thu Feb 10 12:59:06 2022] ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] kasan_report.cold+0x7d/0x117
[Thu Feb 10 12:59:06 2022] ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] __asan_load8+0x86/0xa0
[Thu Feb 10 12:59:06 2022] rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] rcu_core+0x547/0xca0
[Thu Feb 10 12:59:06 2022] ? call_rcu+0x3c0/0x3c0
[Thu Feb 10 12:59:06 2022] ? __this_cpu_preempt_check+0x13/0x20
[Thu Feb 10 12:59:06 2022] ? lock_is_held_type+0xea/0x140
[Thu Feb 10 12:59:06 2022] rcu_core_si+0xe/0x10
[Thu Feb 10 12:59:06 2022] __do_softirq+0x1d4/0x67b
[Thu Feb 10 12:59:06 2022] __irq_exit_rcu+0x100/0x150
[Thu Feb 10 12:59:06 2022] irq_exit_rcu+0xe/0x30
[Thu Feb 10 12:59:06 2022] sysvec_hyperv_stimer0+0x9d/0xc0
...
[Thu Feb 10 12:59:07 2022] Freed by task 58179:
[Thu Feb 10 12:59:07 2022] kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022] kasan_set_track+0x25/0x30
[Thu Feb 10 12:59:07 2022] kasan_set_free_info+0x24/0x40
[Thu Feb 10 12:59:07 2022] ____kasan_slab_free+0x137/0x170
[Thu Feb 10 12:59:07 2022] __kasan_slab_free+0x12/0x20
[Thu Feb 10 12:59:07 2022] slab_free_freelist_hook+0xb3/0x1d0
[Thu Feb 10 12:59:07 2022] kfree+0xcd/0x520
[Thu Feb 10 12:59:07 2022] cifs_smb3_do_mount+0x149/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022] smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022] vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022] path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022] __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022] do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thu Feb 10 12:59:07 2022] Last potentially related work creation:
[Thu Feb 10 12:59:07 2022] kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022] __kasan_record_aux_stack+0xb6/0xc0
[Thu Feb 10 12:59:07 2022] kasan_record_aux_stack_noalloc+0xb/0x10
[Thu Feb 10 12:59:07 2022] call_rcu+0x76/0x3c0
[Thu Feb 10 12:59:07 2022] cifs_umount+0xce/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022] cifs_kill_sb+0xc8/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022] deactivate_locked_super+0x5d/0xd0
[Thu Feb 10 12:59:07 2022] cifs_smb3_do_mount+0xab9/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022] smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022] vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022] path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022] __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022] do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 851e99ebeec3f4a672bb5010cf1ece095acee447 upstream.
Al Viro brought it to my attention that the dentries may not be filled
when the parse_options() is called, causing the call to set_gid() to
possibly crash. It should only be called if parse_options() succeeds
totally anyway.
He suggested the logical place to do the update is in apply_options().
Link: https://lore.kernel.org/all/20220225165219.737025658@goodmis.org/
Link: https://lkml.kernel.org/r/20220225153426.1c4cab6b@gandalf.local.home
Cc: stable@vger.kernel.org
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 48b27b6b5191 ("tracefs: Set all files to the same group ownership as the mount option")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 84ec758fb2daa236026506868c8796b0500c047d ]
When configfs_register_subsystem() or configfs_unregister_subsystem()
is executing link_group() or unlink_group(),
it is possible that two processes add or delete list concurrently.
Some unfortunate interleavings of them can cause kernel panic.
One of cases is:
A --> B --> C --> D
A <-- B <-- C <-- D
delete list_head *B | delete list_head *C
--------------------------------|-----------------------------------
configfs_unregister_subsystem | configfs_unregister_subsystem
unlink_group | unlink_group
unlink_obj | unlink_obj
list_del_init | list_del_init
__list_del_entry | __list_del_entry
__list_del | __list_del
// next == C |
next->prev = prev |
| next->prev = prev
prev->next = next |
| // prev == B
| prev->next = next
Fix this by adding mutex when calling link_group() or unlink_group(),
but parent configfs_subsystem is NULL when config_item is root.
So I create a mutex configfs_subsystem_mutex.
Fixes: 7063fbf22611 ("[PATCH] configfs: User-driven configuration filesystem")
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ea1d1ca4025ac6c075709f549f9aa036b5b6597d upstream.
Check item size before accessing the device item to avoid out of bound
access, similar to inode_item check.
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d19e0183a88306acda07f4a01fedeeffe2a2a06b upstream.
The result of the writeback, whether it is an ENOSPC or an EIO, or
anything else, does not inhibit the NFS client from reporting the
correct file timestamps.
Fixes: 79566ef018f5 ("NFS: Getattr doesn't require data sync semantics")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e0caaf75d443e02e55e146fd75fe2efc8aed5540 upstream.
Commit ac795161c936 (NFSv4: Handle case where the lookup of a directory
fails) [1], part of Linux since 5.17-rc2, introduced a regression, where
a symbolic link on an NFS mount to a directory on another NFS does not
resolve(?) the first time it is accessed:
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Fixes: ac795161c936 ("NFSv4: Handle case where the lookup of a directory fails")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dd5532a4994bfda0386eb2286ec00758cee08444 ]
Strangely, dquot_quota_sync ignores the return code from the ->sync_fs
call, which means that quotacalls like Q_SYNC never see the error. This
doesn't seem right, so fix that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2719c7160dcfaae1f73a1c0c210ad3281c19022e ]
If we fail to synchronize the filesystem while preparing to freeze the
fs, abort the freeze.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2e7be9db125a0bf940c5d65eb5c40d8700f738b5 upstream.
Currently if we get IO error while doing send then we abort without
logging information about which file caused issue. So log it to help
with debugging.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 24d7275ce2791829953ed4e72f68277ceb2571c6 upstream.
The syzbot reported the below BUG:
kernel BUG at include/linux/page-flags.h:785!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
Call Trace:
page_mapcount include/linux/mm.h:837 [inline]
smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
walk_pmd_range mm/pagewalk.c:128 [inline]
walk_pud_range mm/pagewalk.c:205 [inline]
walk_p4d_range mm/pagewalk.c:240 [inline]
walk_pgd_range mm/pagewalk.c:277 [inline]
__walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
walk_page_vma+0x277/0x350 mm/pagewalk.c:530
smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
smap_gather_stats fs/proc/task_mmu.c:741 [inline]
show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
seq_read+0x3e0/0x5b0 fs/seq_file.c:162
vfs_read+0x1b5/0x600 fs/read_write.c:479
ksys_read+0x12d/0x250 fs/read_write.c:619
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The reproducer was trying to read /proc/$PID/smaps when calling
MADV_FREE at the mean time. MADV_FREE may split THPs if it is called
for partial THP. It may trigger the below race:
CPU A CPU B
----- -----
smaps walk: MADV_FREE:
page_mapcount()
PageCompound()
split_huge_page()
page = compound_head(page)
PageDoubleMap(page)
When calling PageDoubleMap() this page is not a tail page of THP anymore
so the BUG is triggered.
This could be fixed by elevated refcount of the page before calling
mapcount, but that would prevent it from counting migration entries, and
it seems overkilling because the race just could happen when PMD is
split so all PTE entries of tail pages are actually migration entries,
and smaps_account() does treat migration entries as mapcount == 1 as
Kirill pointed out.
Add a new parameter for smaps_account() to tell this entry is migration
entry then skip calling page_mapcount(). Don't skip getting mapcount
for device private entries since they do track references with mapcount.
Pagemap also has the similar issue although it was not reported. Fixed
it as well.
[shy828301@gmail.com: v4]
Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
[nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e386dfc56f837da66d00a078e5314bc8382fab83 upstream.
Commit 054aa8d439b9 ("fget: check that the fd still exists after getting
a ref to it") fixed a race with getting a reference to a file just as it
was being closed. It was a fairly minimal patch, and I didn't think
re-checking the file pointer lookup would be a measurable overhead,
since it was all right there and cached.
But I was wrong, as pointed out by the kernel test robot.
The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
quite noticeably. Admittedly it seems to be a very artificial test:
doing "poll()" system calls on regular files in a very tight loop in
multiple threads.
That means that basically all the time is spent just looking up file
descriptors without ever doing anything useful with them (not that doing
'poll()' on a regular file is useful to begin with). And as a result it
shows the extra "re-check fd" cost as a sore thumb.
Happily, the regression is fixable by just writing the code to loook up
the fd to be better and clearer. There's still a cost to verify the
file pointer, but now it's basically in the noise even for that
benchmark that does nothing else - and the code is more understandable
and has better comments too.
[ Side note: this patch is also a classic case of one that looks very
messy with the default greedy Myers diff - it's much more legible with
either the patience of histogram diff algorithm ]
Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Carel Si <beibei.si@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a8d54baba7c65db2d3278873def61f8d3753d766 ]
An fs_location attribute returns a string that can be ipv4, ipv6,
or DNS name. An ip location can have a port appended to it and if
no port is present a default port needs to be set. If rpc_pton()
fails to parse, try calling rpc_uaddr2socaddr() that can convert
an universal address.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f5b27cc6761e27ee6387a24df1a99ca77b360fea ]
Make nfs_parse_server_name available outside of nfs4namespace.c.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 90e12a3191040bd3854d3e236c35921e4e92a044 ]
Remove the check for the zero length fs_locations reply in the
xdr decoding, and instead check for that in the migration code.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>