77620 Commits

Author SHA1 Message Date
Daeho Jeong
f8e2f32bcd f2fs: introduce sysfs atomic write statistics
introduce the below 4 new sysfs node for atomic write statistics.
- current_atomic_write: the total current atomic write block count,
                        which is not committed yet.
- peak_atomic_write: the peak value of total current atomic write block
                     count after boot.
- committed_atomic_block: the accumulated total committed atomic write
                          block count after boot.
- revoked_atomic_block: the accumulated total revoked atomic write block
                        count after boot.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
qixiaoyu1
1adaa71ea9 f2fs: don't bother wait_ms by foreground gc
f2fs_gc returns -EINVAL via f2fs_balance_fs when there is enough free
secs after write checkpoint, but with gc_merge enabled, it will cause
the sleep time of gc thread to be set to no_gc_sleep_time even if there
are many dirty segments can be selected.

Signed-off-by: qixiaoyu1 <qixiaoyu1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
Chao Yu
0d5b9d8156 f2fs: invalidate meta pages only for post_read required inode
After commit e3b49ea36802 ("f2fs: invalidate META_MAPPING before
IPU/DIO write"), invalidate_mapping_pages() will be called to
avoid race condition in between IPU/DIO and readahead for GC.

However, readahead flow is only used for post_read required inode,
so this patch adds check condition to avoids unnecessary page cache
invalidating for non-post_read inode.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Chao Liu
a8634ccf5d f2fs: allow compression of files without blocks
Files created by truncate(1) have a size but no blocks, so
they can be allowed to enable compression.

Signed-off-by: Chao Liu <liuchao@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Chao Yu
7165841d57 f2fs: fix to check inline_data during compressed inode conversion
When converting inode to compressed one via ioctl, it needs to check
inline_data, since inline_data flag and compressed flag are incompatible.

Fixes: 4c8ff7095bef ("f2fs: support data compression")
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Fabio M. De Francesco
1dd55358ef f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
f2fs_copy_page() is a wrapper around two kmap() + one memcpy() from/to
the mapped pages. It unnecessarily duplicates a kernel API and it makes
use of kmap(), which is being deprecated in favor of kmap_local_page().

Two main problems with kmap(): (1) It comes with an overhead as mapping
space is restricted and protected by a global lock for synchronization and
(2) it also requires global TLB invalidation when the kmap’s pool wraps
and it might block when the mapping space is fully utilized until a slot
becomes available.

With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Therefore, its
use in __clone_blkaddrs() is safe and should be preferred.

Delete f2fs_copy_page() and use a plain memcpy_page() in the only one
site calling the removed function. memcpy_page() avoids open coding two
kmap_local_page() + one memcpy() between the two kernel virtual addresses.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:57 -07:00
Chao Yu
67ca06872e f2fs: fix to invalidate META_MAPPING before DIO write
Quoted from commit e3b49ea36802 ("f2fs: invalidate META_MAPPING before
IPU/DIO write")

"
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.

Thread A - f2fs_gc()            Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block()       ---- (a)
up_write(i_gc_rwsem)
                                f2fs_direct_IO() :
                                 - down_read(i_gc_rwsem)
                                 - __blockdev_direct_io()
                                 - get_data_block_dio_write()
                                 - f2fs_dio_submit_bio()  ---- (b)
                                 - up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block()     ---- (c)
up_write(i_gc_rwsem)

(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
    cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
    read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
    new blkaddr. In conclusion, the newly written data in (b) is lost.

To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
"

In previous commit, we missed to cover extent cache hit case, and passed
wrong value for parameter @end of invalidate_mapping_pages(), fix both
issues.

Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC")
Fixes: e3b49ea36802 ("f2fs: invalidate META_MAPPING before IPU/DIO write")
Cc: Hyeong-Jun Kim <hj514.kim@samsung.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
8e0f54a70e f2fs: add a sysfs entry to show zone capacity
This patch adds a sysfs entry showing the unusable space in a section
made by zone capacity.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
074b5ea290 f2fs: adjust zone capacity when considering valid block count
This patch fixes counting unusable blocks set by zone capacity when
checking the valid block count in a section.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
b771aadc6e f2fs: enforce single zone capacity
In order to simplify the complicated per-zone capacity, let's support
only one capacity for entire zoned device.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
duguowei
14de5fc3dd f2fs: remove redundant code for gc condition
Remove the redundant code and use local variant as the
argument directly. Make it more human-readable.

Signed-off-by: duguowei <duguowei@xiaomi.com>
[Jaegeuk Kim: make code neat]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Daeho Jeong
7a8fc58618 f2fs: introduce memory mode
Introduce memory mode to supports "normal" and "low" memory modes.
"low" mode is to support low memory devices. Because of the nature of
low memory devices, in this mode, f2fs will try to save memory sometimes
by sacrificing performance. "normal" mode is the default mode and same
as before.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:12 -07:00
Sebastian Andrzej Siewior
50417d22d0 fs/dcache: Move wakeup out of i_seq_dir write held region.
__d_add() and __d_move() wake up waiters on dentry::d_wait from within
the i_seq_dir write held region.  This violates the PREEMPT_RT
constraints as the wake up acquires wait_queue_head::lock which is a
"sleeping" spinlock on RT.

There is no requirement to do so. __d_lookup_unhash() has cleared
DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait
queue head pointer to the caller, so the actual wake up can be postponed
until the i_dir_seq write side critical section is left. The only
requirement is that dentry::lock is held across the whole sequence
including the wake up. The previous commit includes an analysis why this
is considered safe.

Move the wake up past end_dir_add() which leaves the i_dir_seq write side
critical section and enables preemption.

For non RT kernels there is no difference because preemption is still
disabled due to dentry::lock being held, but it shortens the time between
wake up and unlocking dentry::lock, which reduces the contention for the
woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:38:16 -04:00
Sebastian Andrzej Siewior
45f78b0a27 fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
__d_lookup_done() wakes waiters on dentry->d_wait.  On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.

Calling it under dentry->d_lock is not a problem, since that is also a
"sleeping" spinlock on the same configs.  Unfortunately, two of its
callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.

The key observation is that wakeup can be moved to any point before
dropping ->d_lock.

As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.

This is safe because:

Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags.  As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).

->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry.  Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.

With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.

One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata.  It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.

Remaining exception is d_add_ci(). There wq is what we'd found in
->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry.  Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.

Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock.  That concludes the analysis.

Let __d_lookup_unhash():

  1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
  2) Unhash the dentry
  3) Retrieve and clear dentry::d_wait
  4) Unlock the hash and return the retrieved waitqueue head pointer
  5) Let the caller handle the wake up.
  6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce
     build failures for OOT code that used __d_lookup_done() and is not
     aware of the new return value.

This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.

An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:36:10 -04:00
Sebastian Andrzej Siewior
cf634d540a fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
i_dir_seq is a sequence counter with a lock which is represented by the
lowest bit. The writer atomically updates the counter which ensures that it
can be modified by only one writer at a time. This requires preemption to
be disabled across the write side critical section.

On !PREEMPT_RT kernels this is implicit by the caller acquiring
dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption
which means that a preempting writer or reader would live lock. It's
therefore required to disable preemption explicitly.

An alternative solution would be to replace i_dir_seq with a seqlock_t for
PREEMPT_RT, but that comes with its own set of problems due to arbitrary
lock nesting. A pure sequence count with an associated spinlock is not
possible because the locks held by the caller are not necessarily related.

As the critical section is small, disabling preemption is a sensible
solution.

Reported-by: Oleg.Karfich@wago.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220613140712.77932-2-bigeasy@linutronix.de
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:35:51 -04:00
Al Viro
40a3cb0d23 d_add_ci(): make sure we don't miss d_lookup_done()
All callers of d_alloc_parallel() must make sure that resulting
in-lookup dentry (if any) will encounter __d_lookup_done() before
the final dput().  d_add_ci() might end up creating in-lookup
dentries; they are fed to d_splice_alias(), which will normally
make sure they meet __d_lookup_done().  However, it is possible
to end up with d_splice_alias() failing with ERR_PTR(-ELOOP)
without having done so.  It takes a corrupted ntfs or case-insensitive
xfs image, but neither should end up with memory corruption...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:29:05 -04:00
Christophe JAILLET
45ee6d1e93 ocfs2: fix a typo in a comment
s/heartbaet/heartbeat

Link: https://lkml.kernel.org/r/4d4a6786e8ad522bfad6d2401b7f6634f8af0e5d.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:36 -07:00
Christophe JAILLET
702f3cf374 ocfs2: use the bitmap API to simplify code
Use bitmap_zero() instead of hand-writing it.  It is less verbose.

While at it, add an explicit #include <linux/bitmap.h>.

Link: https://lkml.kernel.org/r/86d2a027c319db12055c98f00c65f7d01e703722.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:36 -07:00
Christophe JAILLET
97d3b2676f ocfs2: remove some useless functions
Patch series "ocfs2: A few clean_ups", v2.

__ocfs2_node_map_set_bit() and __ocfs2_node_map_clear_bit() are just
wrapper around set_bit() and clear_bit().

The leading __ also makes think that these functions are non-atomic just
like __set_bit() and __clear_bit().

So, just remove these wrappers and call set_bit() and clear_bit()
directly.

Link: https://lkml.kernel.org/r/cover.1658436259.git.christophe.jaillet@wanadoo.fr
Link: https://lkml.kernel.org/r/bd1429c84ec7d174c96dbb67a2b42b1b456d9394.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:35 -07:00
Alexey Dobriyan
ed8fb78d7e proc: add some (hopefully) insightful comments
* /proc/${pid}/net status
* removing PDE vs last close stuff (again!)
* random small stuff

Link: https://lkml.kernel.org/r/YtwrM6sDC0OQ53YB@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:35 -07:00
Phillip Lougher
b09a7a036d squashfs: support reading fragments in readahead call
Add a function which can be used to read fragments in the readahead call.

This function is necessary because filesystems built with the -tailends
(or -always-use-fragments) option may have fragments present which cannot
be currently handled.

Link: https://lkml.kernel.org/r/20220617083810.337573-5-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Hsin-Yi Wang
8fc78b6fe2 squashfs: implement readahead
Implement readahead callback for squashfs.  It will read datablocks which
cover pages in readahead request.  For a few cases it will not mark page
as uptodate, including:

- file end is 0.
- zero filled blocks.
- current batch of pages isn't in the same datablock.
- decompressor error.

Otherwise pages will be marked as uptodate.  The unhandled pages will be
updated by readpage later.

Link: https://lkml.kernel.org/r/20220617083810.337573-4-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Phillip Lougher
db98b43086 squashfs: always build "file direct" version of page actor
Squashfs_readahead uses the "file direct" version of the page actor, and
so build it unconditionally.

Link: https://lkml.kernel.org/r/20220617083810.337573-3-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Hsin-Yi Wang
0c12185728 Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
Patch series "Implement readahead for squashfs", v7.

Commit 9eec1d897139("squashfs: provide backing_dev_info in order to
disable read-ahead") mitigates the performance drop issue for squashfs by
closing readahead for it.

This series implements readahead callback for squashfs.


This patch (of 4):

This reverts 9eec1d897139e5 ("squashfs: provide backing_dev_info in order
to disable read-ahead").

Revert closing the readahead to squashfs since the readahead callback for
squashfs is implemented.

Link: https://lkml.kernel.org/r/20220617083810.337573-1-hsinyi@chromium.org
Link: https://lkml.kernel.org/r/20220617083810.337573-2-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>

Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Miaohe Lin
1168076345 hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
In some cases, e.g.  when size option is not specified, f_blocks, f_bavail
and f_bfree will be set to -1 instead of 0.  Likewise, when nr_inodes
isn't specified, f_files and f_ffree will be set to -1 too.  Update the
comment to make this clear.

Link: https://lkml.kernel.org/r/20220726142918.51693-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
445c809829 hugetlbfs: cleanup some comments in inode.c
The function generic_file_buffered_read has been renamed to filemap_read
since commit 87fa0f3eb267 ("mm/filemap: rename generic_file_buffered_read
to filemap_read").  Update the corresponding comment.  And duplicated
taken in hugetlbfs_fill_super is removed.

Link: https://lkml.kernel.org/r/20220726142918.51693-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
990e52b17d hugetlbfs: remove unneeded header file
The header file signal.h is unneeded now. Remove it.

Link: https://lkml.kernel.org/r/20220726142918.51693-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
7ec3c362cf hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
The forward declaration for hugetlbfs_ops is unnecessary.  Remove it.

Link: https://lkml.kernel.org/r/20220726142918.51693-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
d00365175e hugetlbfs: use helper macro SZ_1{K,M}
Patch series "A few cleanup and fixup patches for hugetlbfs", v2.

This series contains a few cleaup patches to remove unneeded forward
declaration, use helper macro and so on.  More details can be found in the
respective changelogs.


This patch (of 5):

Use helper macro SZ_1K and SZ_1M to do the size conversion.  Minor
readability improvement.

Link: https://lkml.kernel.org/r/20220726142918.51693-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220726142918.51693-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Shiyang Ruan
35fcd75af3 xfs: fail dax mount if reflink is enabled on a partition
Failure notification is not supported on partitions.  So, when we mount a
reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.

Link: https://lkml.kernel.org/r/20220609143435.393724-1-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:17 -07:00
Axel Rasmussen
914eedcb9b userfaultfd: don't fail on unrecognized features
The basic interaction for setting up a userfaultfd is, userspace issues
a UFFDIO_API ioctl, and passes in a set of zero or more feature flags,
indicating the features they would prefer to use.

Of course, different kernels may support different sets of features
(depending on kernel version, kconfig options, architecture, etc).
Userspace's expectations may also not match: perhaps it was built
against newer kernel headers, which defined some features the kernel
it's running on doesn't support.

Currently, if userspace passes in a flag we don't recognize, the
initialization fails and we return -EINVAL. This isn't great, though.
Userspace doesn't have an obvious way to react to this; sure, one of the
features I asked for was unavailable, but which one? The only option it
has is to turn off things "at random" and hope something works.

Instead, modify UFFDIO_API to just ignore any unrecognized feature
flags. The interaction is now that the initialization will succeed, and
as always we return the *subset* of feature flags that can actually be
used back to userspace.

Now userspace has an obvious way to react: it checks if any flags it
asked for are missing. If so, it can conclude this kernel doesn't
support those, and it can either resign itself to not using them, or
fail with an error on its own, or whatever else.

Link: https://lkml.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:17 -07:00
Jiachen Zhang
dd524b7f31 ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh()
Some code paths cannot guarantee the inode have any dentry alias. So
WARN_ON() all !dentry may flood the kernel logs.

For example, when an overlayfs inode is watched by inotifywait (1), and
someone is trying to read the /proc/$(pidof inotifywait)/fdinfo/INOTIFY_FD,
at that time if the dentry has been reclaimed by kernel (such as
echo 2 > /proc/sys/vm/drop_caches), there will be a WARN_ON(). The
printed call stack would be like:

    ? show_mark_fhandle+0xf0/0xf0
    show_mark_fhandle+0x4a/0xf0
    ? show_mark_fhandle+0xf0/0xf0
    ? seq_vprintf+0x30/0x50
    ? seq_printf+0x53/0x70
    ? show_mark_fhandle+0xf0/0xf0
    inotify_fdinfo+0x70/0x90
    show_fdinfo.isra.4+0x53/0x70
    seq_show+0x130/0x170
    seq_read+0x153/0x440
    vfs_read+0x94/0x150
    ksys_read+0x5f/0xe0
    do_syscall_64+0x59/0x1e0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

So let's drop WARN_ON() to avoid kernel log flooding.

Reported-by: Hongbo Yin <yinhongbo@bytedance.com>
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Tianci Zhang <zhangtianci.1997@bytedance.com>
Fixes: 8ed5eec9d6c4 ("ovl: encode pure upper file handles")
Cc: <stable@vger.kernel.org> # v4.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-28 15:00:57 +02:00
Yang Xu
ded536561a ovl: improve ovl_get_acl() if POSIX ACL support is off
Provide a proper stub for the !CONFIG_FS_POSIX_ACL case.

Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-28 13:24:51 +02:00
Slark Xiao
3fe4076482 kernfs: Fix typo 'the the' in comment
Replace 'the the' with 'the' in the comment.

Signed-off-by: Slark Xiao <slark_xiao@163.com>
Link: https://lore.kernel.org/r/20220722100518.79741-1-slark_xiao@163.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-28 10:57:25 +02:00
Fabio M. De Francesco
c6e8e36c6a exec: Call kmap_local_page() in copy_string_kernel()
The use of kmap_atomic() is being deprecated in favor of kmap_local_page().

With kmap_local_page(), the mappings are per thread, CPU local and not
globally visible. Furthermore, the mappings can be acquired from any
context (including interrupts).

Therefore, replace kmap_atomic() with kmap_local_page() in
copy_string_kernel(). Instead of open-coding local mapping + memcpy(),
use memcpy_to_page(). Delete a redundant call to flush_dcache_page().

Tested with xfstests on a QEMU/ KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220724212523.13317-1-fmdefrancesco@gmail.com
2022-07-27 14:15:09 -07:00
Yang Li
9c5dd8034e ovl: fix some kernel-doc comments
Remove warnings found by running scripts/kernel-doc,
which is caused by using 'make W=1'.
fs/overlayfs/super.c:311: warning: Function parameter or member 'dentry'
not described in 'ovl_statfs'
fs/overlayfs/super.c:311: warning: Excess function parameter 'sb'
description in 'ovl_statfs'
fs/overlayfs/super.c:357: warning: Function parameter or member 'm' not
described in 'ovl_show_options'
fs/overlayfs/super.c:357: warning: Function parameter or member 'dentry'
not described in 'ovl_show_options'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27 16:31:31 +02:00
Miklos Szeredi
b10b85fe51 ovl: warn if trusted xattr creation fails
When mounting overlayfs in an unprivileged user namespace, trusted xattr
creation will fail.  This will lead to failures in some file operations,
e.g. in the following situation:

  mkdir lower upper work merged
  mkdir lower/directory
  mount -toverlay -olowerdir=lower,upperdir=upper,workdir=work none merged
  rmdir merged/directory
  mkdir merged/directory

The last mkdir will fail:

  mkdir: cannot create directory 'merged/directory': Input/output error

The cause for these failures is currently extremely non-obvious and hard to
debug.  Hence, warn the user and suggest using the userxattr mount option,
if it is not already supplied and xattr creation fails during the
self-check.

Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27 16:31:30 +02:00
Daniil Lunev
247861c325 fuse: retire block-device-based superblock on force unmount
Force unmount of FUSE severes the connection with the user space, even if
there are still open files.  Subsequent remount tries to re-use the
superblock held by the open files, which is meaningless in the FUSE case
after disconnect - reused super block doesn't have userspace counterpart
attached to it and is incapable of doing any IO.

This patch adds the functionality only for the block-device-based supers,
since the primary use case of the feature is to gracefully handle force
unmount of external devices, mounted with FUSE.  This can be further
extended to cover all superblocks, if the need arises.

Signed-off-by: Daniil Lunev <dlunev@chromium.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27 11:30:31 +02:00
Daniil Lunev
04b9407197 vfs: function to prevent re-use of block-device-based superblocks
The function is to be called from filesystem-specific code to mark a
superblock to be ignored by superblock test and thus never re-used.  The
function also unregisters bdi if the bdi is per-superblock to avoid
collision if a new superblock is created to represent the filesystem.
generic_shutdown_super() skips unregistering bdi for a retired superlock as
it assumes retire function has already done it.

This patch adds the functionality only for the block-device-based supers,
since the primary use case of the feature is to gracefully handle force
unmount of external devices, mounted with FUSE.  This can be further
extended to cover all superblocks, if the need arises.

Signed-off-by: Daniil Lunev <dlunev@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27 11:30:30 +02:00
Linus Torvalds
39c3c396f8 Thirteen hotfixes, Eight are cc:stable and the remainder are for post-5.18
issues or are too minor to warrant backporting
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuCV7gAKCRDdBJ7gKXxA
 jrK2AQDeoayQKXJFTcEltKAUTooXM/BoRf+O3ti/xrSWpwta8wEAjaBIJ8e7UlCj
 g+p6u/pd38f226ldzI5w3bIBSPCbnwU=
 =3rO0
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Thirteen hotfixes.

  Eight are cc:stable and the remainder are for post-5.18 issues or are
  too minor to warrant backporting"

* tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: update Gao Xiang's email addresses
  userfaultfd: provide properly masked address for huge-pages
  Revert "ocfs2: mount shared volume without ha stack"
  hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
  fs: sendfile handles O_NONBLOCK of out_fd
  ntfs: fix use-after-free in ntfs_ucsncmp()
  secretmem: fix unhandled fault in truncate
  mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  mm: fix missing wake-up event for FSDAX pages
  mm: fix page leak with multiple threads mapping the same page
  mailmap: update Seth Forshee's email address
  tmpfs: fix the issue that the mount and remount results are inconsistent.
  mm: kfence: apply kmemleak_ignore_phys on early allocated pool
2022-07-26 19:38:46 -07:00
Nadav Amit
d172b1a3bd userfaultfd: provide properly masked address for huge-pages
Commit 824ddc601adc ("userfaultfd: provide unmasked address on
page-fault") was introduced to fix an old bug, in which the offset in the
address of a page-fault was masked.  Concerns were raised - although were
never backed by actual code - that some userspace code might break because
the bug has been around for quite a while.  To address these concerns a
new flag was introduced, and only when this flag is set by the user,
userfaultfd provides the exact address of the page-fault.

The commit however had a bug, and if the flag is unset, the offset was
always masked based on a base-page granularity.  Yet, for huge-pages, the
behavior prior to the commit was that the address is masked to the
huge-page granulrity.

While there are no reports on real breakage, fix this issue.  If the flag
is unset, use the address with the masking that was done before.

Link: https://lkml.kernel.org/r/20220711165906.2682-1-namit@vmware.com
Fixes: 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
Signed-off-by: Nadav Amit <namit@vmware.com>
Reported-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-26 18:25:01 -07:00
Xin Gao
feee1ce45a fsnotify: Fix comment typo
The double `if' is duplicated in line 104, remove one.

Signed-off-by: Xin Gao <gaoxin@cdjrlc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220722194639.18545-1-gaoxin@cdjrlc.com
2022-07-26 13:38:47 +02:00
Jan Kara
fa78f33693 ext2: Add more validity checks for inode counts
Add checks verifying number of inodes stored in the superblock matches
the number computed from number of inodes per group. Also verify we have
at least one block worth of inodes per group. This prevents crashes on
corrupted filesystems.

Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
2022-07-26 13:24:04 +02:00
Zeng Jingxiang
bd6e21a904 fs/reiserfs/inode: remove dead code in _get_block_create_0()
Since commit 27b3a5c51b50 ("kill-the-bkl/reiserfs: drop the fs race
watchdog from _get_block_create_0()"), which removed a label that may
have the pointer 'p' touched in its control flow, related if statements
now eval to constant value now. Just remove them.

Assigning value NULL to p here
293     char *p = NULL;

In the following conditional expression, the value of p is always NULL,
As a result, the kunmap() cannot be executed.
308	if (p)
309		kunmap(bh_result->b_page);

355	if (p)
356		kunmap(bh_result->b_page);

366	if (p)
367		kunmap(bh_result->b_page);

Also, the kmap() cannot be executed.
399	if (!p)
400		p = (char *)kmap(bh_result->b_page);

[JK: Removed unnecessary initialization of 'p' to NULL]

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220720083029.1065578-1-zengjx95@gmail.com
2022-07-26 12:46:36 +02:00
Deming Wang
73fb2c8b61 virtio_fs: Modify format for virtio_fs_direct_access
We should isolate operators with spaces.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-26 10:38:58 +02:00
Christoph Hellwig
0b078d9db8 btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
This flag was used to communicate that the low-level compression code
already did verify the checksum to the high-level I/O completion code.

But it has been unused for a long time as the upper btrfs_bio for the
decompressed data had a NULL csum pointer basically since that pointer
existed and the code already checks for that a little later.

Note that this does not affect the other use of the checked flag, which
is only used for the COW fixup worker.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:56:16 +02:00
Christoph Hellwig
81bd9328ab btrfs: fix repair of compressed extents
Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.

This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.

Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
amount of code has to be reshuffled:

 a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
    to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
    the btrfs_bio management of csums.  For a compressed_bio that is
    split into multiple btrfs_bios this means additional memory
    allocations, but the code becomes a lot more regular.
 b) checksum verification now runs directly on the lower btrfs_bio instead
    of the compressed_bio.  This actually nicely simplifies the end I/O
    processing.
 c) btrfs_repair_one_sector can't just look up the logical address for
    the file offset any more, as there is no corresponding relative
    offsets that apply to the file offset and the logic address for
    compressed extents.  Instead require that the saved bvec_iter in the
    btrfs_bio is filled out for all read bios and use that, which again
    removes a fair amount of code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:56:16 +02:00
Christoph Hellwig
7959bd4411 btrfs: remove the start argument to check_data_csum and export
Derive the value of start from the btrfs_bio now that ->file_offset is
always valid.  Also export and rename the function so it's available
outside of inode.c as we'll need that soon.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:55:32 +02:00
Christoph Hellwig
7aa51232e2 btrfs: pass a btrfs_bio to btrfs_repair_one_sector
Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
and remove the start and failed_mirror arguments in favor of deriving
them from the btrfs_bio.  For this to work ensure that the file_offset
field is also initialized for buffered I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:55:19 +02:00
Christoph Hellwig
524bcd1e17 btrfs: simplify the pending I/O counting in struct compressed_bio
Instead of counting the sectors just count the bios, with an extra
reference held during submission.  This significantly simplifies the
submission side error handling.

This slightly changes completion and error handling of
btrfs_submit_compressed_{read,write} because with the old code the
compressed_bio could have been completed in
submit_compressed_{read,write} only if there was an error during
submission for one of the lower bio, whilst with the new code there is a
chance for this to happen even for successful submission if the all the
lower bios complete before the end of the function is reached.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:54:47 +02:00