84057 Commits

Author SHA1 Message Date
Zhang Yi
d9a4549601 jbd2: drop useless error tag in jbd2_journal_wipe()
no_recovery is redundant, just drop it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:34 -04:00
Zhang Yi
49887e47a5 jbd2: cleanup journal_init_common()
Adjust the initialization sequence and error handle of journal_t, moving
load superblock to the begin, and classify others initialization.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
0dbc759ae9 jbd2: add fast_commit space check
If JBD2_FEATURE_INCOMPAT_FAST_COMMIT bit is set, it means the journal
have fast commit records need to recover, so the fast commit size
should not be too large, and the leftover normal journal size should
never less than JBD2_MIN_JOURNAL_BLOCKS. If it happens, the
journal->j_last is likely to be wrong and will probably lead to
incorrect journal recovery. So add a check into the
journal_check_superblock(), and drop the pointless check when
initializing the fastcommit parameters.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
054d9c8fef jbd2: cleanup load_superblock()
Rename load_superblock() to journal_load_superblock(), move getting and
reading superblock from journal_init_common() and
journal_get_superblock() to this function, and also rename
journal_get_superblock() to journal_check_superblock(), make it a pure
check helper to check superblock validity from disk.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
18dad509e7 jbd2: open code jbd2_verify_csum_type() helper
jbd2_verify_csum_type() helper check checksum type in the superblock for
v2 or v3 checksum feature, it always return true if these features are
not enabled, and it has only one user, so open code it is more clear.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
e4adf8b837 jbd2: checking valid features early in journal_get_superblock()
journal_get_superblock() is used to check validity of the jounal
supberblock, so move the features checks from jbd2_journal_load() to
journal_get_superblock().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
9600f3e5cf jbd2: don't load superblock in jbd2_journal_check_used_features()
Since load_superblock() has been moved to journal_init_common(), the
in-memory superblock structure is initialized and contains valid data
once the file system has a journal_t object, so it's safe to access it,
let's drop the call to journal_get_superblock() from
jbd2_journal_check_used_features() and also drop the setting/clearing of
the veirfy bit of the superblock buffer.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
c30713084b jbd2: move load_superblock() into journal_init_common()
Move the call to load_superblock() from jbd2_journal_load() and
jbd2_journal_wipe() early into journal_init_common(), the journal
superblock gets read and the in-memory journal_t structure gets
initialised after calling jbd2_journal_init_{dev,inode}, it's safe to
do following initialization according to it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Zhang Yi
29a511e49f jbd2: move load_superblock() dependent functions
Move load_superblock() declaration and the functions it calls before
journal_init_common(). This is a preparation for moving a call to
load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() to
journal_init_common(). No functional changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15 09:00:33 -04:00
Yang Li
89cbd4c036 fs: Fix one kernel-doc comment
Fix one kernel-doc comment to silence the warning:

fs/read_write.c:88: warning: Function parameter or member 'maxsize' not described in 'generic_file_llseek_size'

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Message-Id: <20230811014359.4960-1-yang.lee@linux.alibaba.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15 08:32:45 +02:00
Marcelo Tosatti
8a237adf21 fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
For certain types of applications (for example PLC software or
RAN processing), upon occurrence of an event, it is necessary to
complete a certain task in a maximum amount of time (deadline).

One way to express this requirement is with a pair of numbers,
deadline time and execution time, where:

        * deadline time: length of time between event and deadline.
        * execution time: length of time it takes for processing of event
                          to occur on a particular hardware platform
                          (uninterrupted).

The particular values depend on use-case. For the case
where the realtime application executes in a virtualized
guest, an IPI which must be serviced in the host will cause
the following sequence of events:

        1) VM-exit
        2) execution of IPI (and function call)
        3) VM-entry

Which causes an excess of 50us latency as observed by cyclictest
(this violates the latency requirement of vRAN application with 1ms TTI,
for example).

invalidate_bh_lrus calls an IPI on each CPU that has non empty
per-CPU cache:

        on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);

The performance when using the per-CPU LRU cache is as follows:

 42 ns per __find_get_block
 68 ns per __find_get_block_slow

Given that the main use cases for latency sensitive applications
do not involve block I/O (data necessary for program operation is
locked in RAM), disable per-CPU buffer_head caches for isolated CPUs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Message-Id: <ZJtBrybavtb1x45V@tpad>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15 08:32:44 +02:00
David Howells
d80a8f1b58 vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
When NFS superblocks are created by automounting, their LSM parameters
aren't set in the fs_context struct prior to sget_fc() being called,
leading to failure to match existing superblocks.

This bug leads to messages like the following appearing in dmesg when
fscache is enabled:

    NFS: Cache volume key already in use (nfs,4.2,2,108,106a8c0,1,,,,100000,100000,2ee,3a98,1d4c,3a98,1)

Fix this by adding a new LSM hook to load fc->security for submount
creation.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/165962680944.3334508.6610023900349142034.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165962729225.3357250.14350728846471527137.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/165970659095.2812394.6868894171102318796.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166133579016.3678898.6283195019480567275.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/217595.1662033775@warthog.procyon.org.uk/ # v5
Fixes: 9bc61ab18b1d ("vfs: Introduce fs_context, switch vfs_kern_mount() to it.")
Fixes: 779df6a5480f ("NFS: Ensure security label is set for root inode")
Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230808-master-v9-1-e0ecde888221@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15 08:32:30 +02:00
Zhiguo Niu
0cc81b1ad5 f2fs: should update REQ_TIME for direct write
The sending interval of discard and GC should also
consider direct write requests; filesystem is not
idle if there is direct write.

Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:42:28 -07:00
Chao Yu
eb61c2cca2 f2fs: fix to account cp stats correctly
cp_foreground_calls sysfs entry shows total CP call count rather than
foreground CP call count, fix it.

Fixes: fc7100ea2a52 ("f2fs: Add f2fs stats to sysfs")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:42:05 -07:00
Chao Yu
9bf1dcbdfd f2fs: fix to account gc stats correctly
As reported, status debugfs entry shows inconsistent GC stats as below:

GC calls: 6008 (BG: 6161)
  - data segments : 3053 (BG: 3053)
  - node segments : 2955 (BG: 2955)

Total GC calls is larger than BGGC calls, the reason is:
- f2fs_stat_info.call_count accounts total migrated section count
by f2fs_gc()
- f2fs_stat_info.bg_gc accounts total call times of f2fs_gc() from
background gc_thread

Another issue is gc_foreground_calls sysfs entry shows total GC call
count rather than FGGC call count.

This patch changes as below for fix:
- account GC calls and migrated segment count separately
- support to account migrated section count if it enables large section
mode
- fix to show correct value in gc_foreground_calls sysfs entry

Fixes: fc7100ea2a52 ("f2fs: Add f2fs stats to sysfs")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:57 -07:00
Chao Yu
bc3994ffa4 f2fs: remove unneeded check condition in __f2fs_setxattr()
It has checked return value of write_all_xattrs(), remove unneeded
following check condition.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:10 -07:00
Chao Yu
8874ad7dae f2fs: fix to update i_ctime in __f2fs_setxattr()
generic/728       - output mismatch (see /media/fstests/results//generic/728.out.bad)
    --- tests/generic/728.out	2023-07-19 07:10:48.362711407 +0000
    +++ /media/fstests/results//generic/728.out.bad	2023-07-19 08:39:57.000000000 +0000
     QA output created by 728
    +Expected ctime to change after setxattr.
    +Expected ctime to change after removexattr.
     Silence is golden
    ...
    (Run 'diff -u /media/fstests/tests/generic/728.out /media/fstests/results//generic/728.out.bad'  to see the entire diff)
generic/729        1s

It needs to update i_ctime after {set,remove}xattr, fix it.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:09 -07:00
Chao Yu
958ccbbf1c Revert "f2fs: fix to do sanity check on extent cache correctly"
syzbot reports a f2fs bug as below:

UBSAN: array-index-out-of-bounds in fs/f2fs/f2fs.h:3275:19
index 1409 is out of range for type '__le32[923]' (aka 'unsigned int[923]')
Call Trace:
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
 ubsan_epilogue lib/ubsan.c:217 [inline]
 __ubsan_handle_out_of_bounds+0x11c/0x150 lib/ubsan.c:348
 inline_data_addr fs/f2fs/f2fs.h:3275 [inline]
 __recover_inline_status fs/f2fs/inode.c:113 [inline]
 do_read_inode fs/f2fs/inode.c:480 [inline]
 f2fs_iget+0x4730/0x48b0 fs/f2fs/inode.c:604
 f2fs_fill_super+0x640e/0x80c0 fs/f2fs/super.c:4601
 mount_bdev+0x276/0x3b0 fs/super.c:1391
 legacy_get_tree+0xef/0x190 fs/fs_context.c:611
 vfs_get_tree+0x8c/0x270 fs/super.c:1519
 do_new_mount+0x28f/0xae0 fs/namespace.c:3335
 do_mount fs/namespace.c:3675 [inline]
 __do_sys_mount fs/namespace.c:3884 [inline]
 __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The issue was bisected to:

commit d48a7b3a72f121655d95b5157c32c7d555e44c05
Author: Chao Yu <chao@kernel.org>
Date:   Mon Jan 9 03:49:20 2023 +0000

    f2fs: fix to do sanity check on extent cache correctly

The root cause is we applied both v1 and v2 of the patch, v2 is the right
fix, so it needs to revert v1 in order to fix reported issue.

v1:
commit d48a7b3a72f1 ("f2fs: fix to do sanity check on extent cache correctly")
https://lore.kernel.org/lkml/20230109034920.492914-1-chao@kernel.org/

v2:
commit 269d11948100 ("f2fs: fix to do sanity check on extent cache correctly")
https://lore.kernel.org/lkml/20230207134808.1827869-1-chao@kernel.org/

Reported-by: syzbot+601018296973a481f302@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000fcf0690600e4d04d@google.com/
Fixes: d48a7b3a72f1 ("f2fs: fix to do sanity check on extent cache correctly")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:09 -07:00
Minjie Du
a842a90926 f2fs: increase usage of folio_next_index() helper
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().

Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:09 -07:00
Chunhai Guo
2bd4df8fcb f2fs: Only lfs mode is allowed with zoned block device feature
Now f2fs support four block allocation modes: lfs, adaptive,
fragment:segment, fragment:block. Only lfs mode is allowed with zoned block
device feature.

Fixes: 6691d940b0e0 ("f2fs: introduce fragment allocation mode mount option")
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:09 -07:00
Shin'ichiro Kawasaki
3cb88bc159 f2fs: check zone type before sending async reset zone command
The commit 25f9080576b9 ("f2fs: add async reset zone command support")
introduced "async reset zone commands" by calling
__submit_zone_reset_cmd() in async discard operations. However,
__submit_zone_reset_cmd() is called regardless of zone type of discard
target zone. When devices have conventional zones, zone reset commands
are sent to the conventional zones and cause I/O errors.

Avoid the I/O errors by checking that the discard target zone type is
sequential write required. If not, handle the discard operation in same
manner as non-zoned, regular block devices. For that purpose, add a new
helper function f2fs_bdev_index() which gets index of the zone reset
target device.

Fixes: 25f9080576b9 ("f2fs: add async reset zone command support")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:09 -07:00
Chao Yu
025b3602b5 f2fs: compress: don't {,de}compress non-full cluster
f2fs won't compress non-full cluster in tail of file, let's skip
dirtying and rewrite such cluster during f2fs_ioc_{,de}compress_file.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:08 -07:00
Chao Yu
3a2c0e55f9 f2fs: allow f2fs_ioc_{,de}compress_file to be interrupted
This patch allows f2fs_ioc_{,de}compress_file() to be interrupted, so that,
userspace won't be blocked when manual {,de}compression on large file is
interrupted by signal.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:08 -07:00
Christoph Hellwig
51bf8d3c81 f2fs: don't reopen the main block device in f2fs_scan_devices
f2fs_scan_devices reopens the main device since the very beginning, which
has always been useless, and also means that we don't pass the right
holder for the reopen, which now leads to a warning as the core super.c
holder ops aren't passed in for the reopen.

Fixes: 3c62be17d4f5 ("f2fs: support multiple devices")
Fixes: 0718afd47f70 ("block: introduce holder ops")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:08 -07:00
Chao Yu
b5ab3276eb f2fs: fix to avoid mmap vs set_compress_option case
Compression option in inode should not be changed after they have
been used, however, it may happen in below race case:

Thread A				Thread B
- f2fs_ioc_set_compress_option
 - check f2fs_is_mmap_file()
 - check get_dirty_pages()
 - check F2FS_HAS_BLOCKS()
					- f2fs_file_mmap
					 - set_inode_flag(FI_MMAP_FILE)
					- fault
					 - do_page_mkwrite
					  - f2fs_vm_page_mkwrite
					  - f2fs_get_block_locked
					 - fault_dirty_shared_page
					  - set_page_dirty
 - update i_compress_algorithm
 - update i_log_cluster_size
 - update i_cluster_size

Avoid such race condition by covering f2fs_file_mmap() w/ i_sem lock,
meanwhile add mmap file check condition in f2fs_may_compress() as well.

Fixes: e1e8debec656 ("f2fs: add F2FS_IOC_SET_COMPRESS_OPTION ioctl")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:07 -07:00
Jaegeuk Kim
d2d9bb3b6d f2fs: get out of a repeat loop when getting a locked data page
https://bugzilla.kernel.org/show_bug.cgi?id=216050

Somehow we're getting a page which has a different mapping.
Let's avoid the infinite loop.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:07 -07:00
Jaegeuk Kim
a3ab557466 f2fs: flush inode if atomic file is aborted
Let's flush the inode being aborted atomic operation to avoid stale dirty
inode during eviction in this call stack:

  f2fs_mark_inode_dirty_sync+0x22/0x40 [f2fs]
  f2fs_abort_atomic_write+0xc4/0xf0 [f2fs]
  f2fs_evict_inode+0x3f/0x690 [f2fs]
  ? sugov_start+0x140/0x140
  evict+0xc3/0x1c0
  evict_inodes+0x17b/0x210
  generic_shutdown_super+0x32/0x120
  kill_block_super+0x21/0x50
  deactivate_locked_super+0x31/0x90
  cleanup_mnt+0x100/0x160
  task_work_run+0x59/0x90
  do_exit+0x33b/0xa50
  do_group_exit+0x2d/0x80
  __x64_sys_exit_group+0x14/0x20
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x63/0xcd

This triggers f2fs_bug_on() in f2fs_evict_inode:
 f2fs_bug_on(sbi, is_inode_flag_set(inode, FI_DIRTY_INODE));

This fixes the syzbot report:

loop0: detected capacity change from 0 to 131072
F2FS-fs (loop0): invalid crc value
F2FS-fs (loop0): Found nat_bits in checkpoint
F2FS-fs (loop0): Mounted with checkpoint version = 48b305e4
------------[ cut here ]------------
kernel BUG at fs/f2fs/inode.c:869!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 5014 Comm: syz-executor220 Not tainted 6.4.0-syzkaller-11479-g6cd06ab12d1a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
RIP: 0010:f2fs_evict_inode+0x172d/0x1e00 fs/f2fs/inode.c:869
Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 6a 06 00 00 8b 75 40 ba 01 00 00 00 4c 89 e7 e8 6d ce 06 00 e9 aa fc ff ff e8 63 22 e2 fd <0f> 0b e8 5c 22 e2 fd 48 c7 c0 a8 3a 18 8d 48 ba 00 00 00 00 00 fc
RSP: 0018:ffffc90003a6fa00 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff8880273b8000 RSI: ffffffff83a2bd0d RDI: 0000000000000007
RBP: ffff888077db91b0 R08: 0000000000000007 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff888029a3c000
R13: ffff888077db9660 R14: ffff888029a3c0b8 R15: ffff888077db9c50
FS:  0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1909bb9000 CR3: 00000000276a9000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 evict+0x2ed/0x6b0 fs/inode.c:665
 dispose_list+0x117/0x1e0 fs/inode.c:698
 evict_inodes+0x345/0x440 fs/inode.c:748
 generic_shutdown_super+0xaf/0x480 fs/super.c:478
 kill_block_super+0x64/0xb0 fs/super.c:1417
 kill_f2fs_super+0x2af/0x3c0 fs/f2fs/super.c:4704
 deactivate_locked_super+0x98/0x160 fs/super.c:330
 deactivate_super+0xb1/0xd0 fs/super.c:361
 cleanup_mnt+0x2ae/0x3d0 fs/namespace.c:1254
 task_work_run+0x16f/0x270 kernel/task_work.c:179
 exit_task_work include/linux/task_work.h:38 [inline]
 do_exit+0xa9a/0x29a0 kernel/exit.c:874
 do_group_exit+0xd4/0x2a0 kernel/exit.c:1024
 __do_sys_exit_group kernel/exit.c:1035 [inline]
 __se_sys_exit_group kernel/exit.c:1033 [inline]
 __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1033
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f309be71a09
Code: Unable to access opcode bytes at 0x7f309be719df.
RSP: 002b:00007fff171df518 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f309bef7330 RCX: 00007f309be71a09
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 00007f309bef1e40
R10: 0000000000010600 R11: 0000000000000246 R12: 00007f309bef7330
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:f2fs_evict_inode+0x172d/0x1e00 fs/f2fs/inode.c:869
Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 6a 06 00 00 8b 75 40 ba 01 00 00 00 4c 89 e7 e8 6d ce 06 00 e9 aa fc ff ff e8 63 22 e2 fd <0f> 0b e8 5c 22 e2 fd 48 c7 c0 a8 3a 18 8d 48 ba 00 00 00 00 00 fc
RSP: 0018:ffffc90003a6fa00 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff8880273b8000 RSI: ffffffff83a2bd0d RDI: 0000000000000007
RBP: ffff888077db91b0 R08: 0000000000000007 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff888029a3c000
R13: ffff888077db9660 R14: ffff888029a3c0b8 R15: ffff888077db9c50
FS:  0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1909bb9000 CR3: 00000000276a9000 CR4: 0000000000350ef0

Cc: <stable@vger.kernel.org>
Reported-and-tested-by: syzbot+e1246909d526a9d470fa@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:07 -07:00
Chao Yu
863907a4f5 f2fs: don't handle error case of f2fs_compress_alloc_page()
f2fs_compress_alloc_page() uses mempool to allocate memory, it never
fail, don't handle error case in its callers.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:41:06 -07:00
Jaegeuk Kim
579c7e4150 Revert "f2fs: clean up w/ sbi->log_sectors_per_block"
This reverts commit bfd476623999118d9c509cb0fa9380f2912bc225.

Shinichiro Kawasaki reported:

When I ran workloads on f2fs using v6.5-rcX with fixes [1][2] and a zoned block
devices with 4kb logical block size, I observe mount failure as follows. When
I revert this commit, the failure goes away.

[  167.781975][ T1555] F2FS-fs (dm-0): IO Block Size:        4 KB
[  167.890728][ T1555] F2FS-fs (dm-0): Found nat_bits in checkpoint
[  171.482588][ T1555] F2FS-fs (dm-0): Zone without valid block has non-zero write pointer. Reset the write pointer: wp[0x1300,0x8]
[  171.496000][ T1555] F2FS-fs (dm-0): (0) : Unaligned zone reset attempted (block 280000 + 80000)
[  171.505037][ T1555] F2FS-fs (dm-0): Discard zone failed:  (errno=-5)

The patch replaced "sbi->log_blocksize - SECTOR_SHIFT" with
"sbi->log_sectors_per_block". However, I think these two are not equal when the
device has 4k logical block size. The former uses Linux kernel sector size 512
byte. The latter use 512b sector size or 4kb sector size depending on the
device. mkfs.f2fs obtains logical block size via BLKSSZGET ioctl from the device
and reflects it to the value sbi->log_sector_size_per_block. This causes
unexpected write pointer calculations in check_zone_write_pointer(). This
resulted in unexpected zone reset and the mount failure.

[1] https://lkml.kernel.org/linux-f2fs-devel/20230711050101.GA19128@lst.de/
[2] https://lore.kernel.org/linux-f2fs-devel/20230804091556.2372567-1-shinichiro.kawasaki@wdc.com/

Cc: stable@vger.kernel.org
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: bfd476623999 ("f2fs: clean up w/ sbi->log_sectors_per_block")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-14 13:40:27 -07:00
Christian Brauner
22ed7ecdae
fs: add FSCONFIG_CMD_CREATE_EXCL
Summary
=======

This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to
implement something like mount -t ext4 --exclusive /dev/sda /B which
fails if a superblock for the requested filesystem does already exist:

Before this patch
-----------------

$ sudo ./move-mount -f xfs -o source=/dev/sda4 /A
Requesting filesystem type xfs
Mount options requested: source=/dev/sda4
Attaching mount at /A
Moving single attached mount
Setting key(source) with val(/dev/sda4)

$ sudo ./move-mount -f xfs -o source=/dev/sda4 /B
Requesting filesystem type xfs
Mount options requested: source=/dev/sda4
Attaching mount at /B
Moving single attached mount
Setting key(source) with val(/dev/sda4)

After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL
--------------------------------------------------------------------------

$ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A
Requesting filesystem type xfs
Request exclusive superblock creation
Mount options requested: source=/dev/sda4
Attaching mount at /A
Moving single attached mount
Setting key(source) with val(/dev/sda4)

$ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B
Requesting filesystem type xfs
Request exclusive superblock creation
Mount options requested: source=/dev/sda4
Attaching mount at /B
Moving single attached mount
Setting key(source) with val(/dev/sda4)
Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed

Details
=======

As mentioned on the list (cf. [1]-[3]) mount requests like
mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new
superblock has been created and mounted or an existing superblock has
been reused and a bind-mount has been created.

This becomes clear in the following example where two processes create
the same mount for the same block device:

P1                                                              P2
fd_fs = fsopen("ext4");                                         fd_fs = fsopen("ext4");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");     fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always");          fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000");

// wins and creates superblock
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
                                                                // finds compatible superblock of P1
                                                                // spins until P1 sets SB_BORN and grabs a reference
                                                                fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)

fd_mnt1 = fsmount(fd_fs);                                       fd_mnt2 = fsmount(fd_fs);
move_mount(fd_mnt1, "/A")                                       move_mount(fd_mnt2, "/B")

Not just does P2 get a bind-mount but the mount options that P2
requestes are silently ignored. The VFS itself doesn't, can't and
shouldn't enforce filesystem specific mount option compatibility. It
only enforces incompatibility for read-only <-> read-write transitions:

mount -t ext4       /dev/sda /A
mount -t ext4 -o ro /dev/sda /B

The read-only request will fail with EBUSY as the VFS can't just
silently transition a superblock from read-write to read-only or vica
versa without risking security issues.

To userspace this silent superblock reuse can become a security issue in
because there is currently no straightforward way for userspace to know
that they did indeed manage to create a new superblock and didn't just
reuse an existing one.

This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that
returns EBUSY if an existing superblock would be reused. Userspace that
needs to be sure that it did create a new superblock with the requested
mount options can request superblock creation using this command. If the
command succeeds they can be sure that they did create a new superblock
with the requested mount options.

This requires the new mount api. With the old mount api it would be
necessary to plumb this through every legacy filesystem's
file_system_type->mount() method. If they want this feature they are
most welcome to switch to the new mount api.

Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
each high-level superblock creation helper:

(1) get_tree_nodev()

    Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and
    FSCONFIG_CMD_CREATE_EXCL are equivalent.

    The binderfs or overlayfs filesystems are examples.

(4) get_tree_keyed()

    Finds an existing superblock based on sb->s_fs_info. Hence,
    FSCONFIG_CMD_CREATE would reuse an existing superblock whereas
    FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.

    The mqueue or nfsd filesystems are examples.

(2) get_tree_bdev()

    This effectively works like get_tree_keyed().

    The ext4 or xfs filesystems are examples.

(3) get_tree_single()

    Only one superblock of this filesystem type can ever exist.
    Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock
    whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.

    The securityfs or configfs filesystems are examples.

    Note that some single-instance filesystems never destroy the
    superblock once it has been created during the first mount. For
    example, if securityfs has been mounted at least onces then the
    created superblock will never be destroyed again as long as there is
    still an LSM making use it. Consequently, even if securityfs is
    unmounted and the superblock seemingly destroyed it really isn't
    which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting
    reusing an existing superblock.

    This is acceptable thugh since special purpose filesystems such as
    this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway
    and if they do it's probably to make sure that mount options aren't
    ignored.

Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
filesystems that make use of the low-level sget_fc() helper directly.
They're all effectively variants on get_tree_keyed(), get_tree_bdev(),
or get_tree_nodev():

(5) mtd_get_sb()

    Similar logic to get_tree_keyed().

(6) afs_get_tree()

    Similar logic to get_tree_keyed().

(7) ceph_get_tree()

    Similar logic to get_tree_keyed().

    Already explicitly allows forcing the allocation of a new superblock
    via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev().

(8) fuse_get_tree_submount()

    Similar logic to get_tree_nodev().

(9) fuse_get_tree()

    Forces reuse of existing FUSE superblock.

    Forces reuse of existing superblock if passed in file refers to an
    existing FUSE connection.
    If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd
    referring to an existing FUSE connections this would cause the
    superblock reusal to fail. If reusing is the intent then
    FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.

(10) fuse_get_tree()
     -> get_tree_nodev()

    Same logic as in get_tree_nodev().

(11) fuse_get_tree()
     -> get_tree_bdev()

    Same logic as in get_tree_bdev().

(12) virtio_fs_get_tree()

     Same logic as get_tree_keyed().

(13) gfs2_meta_get_tree()

     Forces reuse of existing gfs2 superblock.

     Mounting gfs2meta enforces that a gf2s superblock must already
     exist. If not, it will error out. Consequently, mounting gfs2meta
     with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the
     intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.

(14) kernfs_get_tree()

     Similar logic to get_tree_keyed().

(15) nfs_get_tree_common()

    Similar logic to get_tree_keyed().

    Already explicitly allows forcing the allocation of a new superblock
    via NFS_MOUNT_UNSHARED. This effectively turns it into
    get_tree_nodev().

Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14 18:48:02 +02:00
Christian Brauner
11a51d8c13
fs: add vfs_cmd_reconfigure()
Split the steps to reconfigure a superblock into a tiny helper instead
of open-coding it in the switch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-3-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14 18:48:02 +02:00
Christian Brauner
dae8b08d5d
fs: add vfs_cmd_create()
Split the steps to create a superblock into a tiny helper. This will
make the next patch easier to follow.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-2-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14 18:48:02 +02:00
Christian Brauner
e062abaec6
super: remove get_tree_single_reconf()
The get_tree_single_reconf() helper isn't used anywhere. Remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-1-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14 18:48:02 +02:00
Miklos Szeredi
91ec6c8559 Revert "fuse: in fuse_flush only wait if someone wants the return code"
This reverts commit 5a8bee63b10f6f2f52f6d22e109a4a147409842a.

Jürg Billeter reports the following regression:

  Since v6.3-rc1 commit 5a8bee63b1 ("fuse: in fuse_flush only wait if
  someone wants the return code") `fput()` is called asynchronously if a
  file is closed as part of a process exiting, i.e., if there was no
  explicit `close()` before exit.

  If the file was open for writing, also `put_write_access()` is called
  asynchronously as part of the async `fput()`.

  If that newly written file is an executable, attempting to `execve()` the
  new file can fail with `ETXTBSY` if it's called after the writer process
  exited but before the async `fput()` has run.

Reported-and-tested-by: "Jürg Billeter" <j@bitron.ch>
Cc: <stable@vger.kernel.org> # v6.3
Link: https://lore.kernel.org/all/4f66cded234462964899f2a661750d6798a57ec0.camel@bitron.ch/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-14 16:23:06 +02:00
Filipe Manana
9b378f6ad4 btrfs: fix infinite directory reads
The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.

The following C program and test script reproduce the problem:

  $ cat /mnt/readdir_prog.c
  #include <sys/types.h>
  #include <dirent.h>
  #include <stdio.h>

  int main(int argc, char *argv[])
  {
    DIR *dir = opendir(".");
    struct dirent *dd;

    while ((dd = readdir(dir))) {
      printf("%s\n", dd->d_name);
      rename(dd->d_name, "TEMPFILE");
      rename("TEMPFILE", dd->d_name);
    }
    closedir(dir);
  }

  $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV &> /dev/null
  #mkfs.xfs -f $DEV &> /dev/null
  #mkfs.ext4 -F $DEV &> /dev/null

  mount $DEV $MNT

  mkdir $MNT/testdir
  for ((i = 1; i <= 2000; i++)); do
      echo -n > $MNT/testdir/file_$i
  done

  cd $MNT/testdir
  /mnt/readdir_prog

  cd /mnt

  umount $MNT

This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.

So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.

Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-14 16:17:37 +02:00
Steve French
7b38f6ddc9 smb3: display network namespace in debug information
We recently had problems where a network namespace was deleted
causing hard to debug reconnect problems.  To help deal with
configuration issues like this it is useful to dump the network
namespace to better debug what happened.

So add this to information displayed in /proc/fs/cifs/DebugData for
the server (and channels if mounted with multichannel). For example:

   Local Users To Server: 1 SecMode: 0x1 Req On Wire: 0 Net namespace: 4026531840

This can be easily compared with what is displayed for the
processes on the system. For example /proc/1/ns/net in this case
showed the same thing (see below), and we can see that the namespace
is still valid in this example.

   'net:[4026531840]'

Cc: stable@vger.kernel.org
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-14 08:41:29 -05:00
Russell Harmon via samba-technical
69513dd669 cifs: Release folio lock on fscache read hit.
Under the current code, when cifs_readpage_worker is called, the call
contract is that the callee should unlock the page. This is documented
in the read_folio section of Documentation/filesystems/vfs.rst as:

> The filesystem should unlock the folio once the read has completed,
> whether it was successful or not.

Without this change, when fscache is in use and cache hit occurs during
a read, the page lock is leaked, producing the following stack on
subsequent reads (via mmap) to the page:

$ cat /proc/3890/task/12864/stack
[<0>] folio_wait_bit_common+0x124/0x350
[<0>] filemap_read_folio+0xad/0xf0
[<0>] filemap_fault+0x8b1/0xab0
[<0>] __do_fault+0x39/0x150
[<0>] do_fault+0x25c/0x3e0
[<0>] __handle_mm_fault+0x6ca/0xc70
[<0>] handle_mm_fault+0xe9/0x350
[<0>] do_user_addr_fault+0x225/0x6c0
[<0>] exc_page_fault+0x84/0x1b0
[<0>] asm_exc_page_fault+0x27/0x30

This requires a reboot to resolve; it is a deadlock.

Note however that the call to cifs_readpage_from_fscache does mark the
page clean, but does not free the folio lock. This happens in
__cifs_readpage_from_fscache on success. Releasing the lock at that
point however is not appropriate as cifs_readahead also calls
cifs_readpage_from_fscache and *does* unconditionally release the lock
after its return. This change therefore effectively makes
cifs_readpage_worker work like cifs_readahead.

Signed-off-by: Russell Harmon <russ@har.mn>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-14 08:39:12 -05:00
Linus Torvalds
a785fd28d3 for-6.5-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTXzuUACgkQxWXV+ddt
 WDvQVg/+PwDYtfsFBBxWboR/Ehu+nGj+PGRGH5kUumCt03760GtVMYqJzakinAoA
 TUg7N+SvC0i6STUQ1LxkqdyU+eHxk0D1qwK7HJtbqNQJ+kwaEPlilHwsptMmuuM/
 xaei+C8gLmQreL5ZH6ZsnLfV4aaFR7Ur8KSiAq28H6dKXGnh4q9yio2BspeFoc4Y
 8cD2Q8eOUxLBbGkAy9RHeMWf6OMOv2jyzdA761NZrjxUe23bDWSdRM6cRhfdJIh+
 gfwW1IVH2EVOwo+FeaIpMSf4dpnenOYOKOftTncrz7XS0VEN/wJYQXGjNbLa7u4d
 RxV2RujzRPePAUKDbLRakfXotcuKdSQuX2epLSYkQTfGQ0KRYu5YIDQgkm3r7Yky
 cF5mkyEyI8lFCiop7Bgi3MqnzoY5ZgWAkWSy9/TzjQ4yRhjiZ3fmk5JgoJ8gwUc3
 Fle4czcmKvk6ZqQAn90b0qGtW9FXzVAekZjLAH26O7+dgEn+CCAfwT9GuG7h+ATM
 9Bh+5U5PWxWmNPTYU8Sn+WR9HpVL6+1maxrax/Ftb8/FuFlQXFHxK+OnTcKx9K+y
 OGsv0r/4Zv517k1qqlHvf397Jvz7MmYLyOwkqu5xyomCGtrKIBkkEGF/9sHrZJVM
 YokgphDZL8AILrnnPwCOgt4lsph1VKS/Sgvu7XKovnZbvvh8S+M=
 =csAj
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "More fixes, some of them going back to older releases and there are
  fixes for hangs in stress tests regarding space caching:

   - fixes and progress tracking for hangs in free space caching, found
     by test generic/475

   - writeback fixes, write pages in integrity mode and skip writing
     pages that have been written meanwhile

   - properly clear end of extent range after an error

   - relocation fixes:
      - fix race betwen qgroup tree creation and relocation
      - detect and report invalid reloc roots"

* tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: set cache_block_group_error if we find an error
  btrfs: reject invalid reloc tree root keys with stack dump
  btrfs: exit gracefully if reloc roots don't match
  btrfs: avoid race between qgroup tree creation and relocation
  btrfs: properly clear end of the unreserved range in cow_file_range
  btrfs: don't wait for writeback on clean pages in extent_write_cache_pages
  btrfs: don't stop integrity writeback too early
  btrfs: wait for actual caching progress during allocation
2023-08-12 13:28:55 -07:00
Mateusz Guzik
c8afaa1b0f locking: remove spin_lock_prefetch
The only remaining consumer is new_inode, where it showed up in 2001 as
commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1]
with a changelog which does not mention it.

Since then the line got only touched up to keep compiling.

While it may have been of benefit back in the day, it is guaranteed to
at best not get in the way in the multicore setting -- as the code
performs *a lot* of work between the prefetch and actual lock acquire,
any contention means the cacheline is already invalid by the time the
routine calls spin_lock().  It adds spurious traffic, for short.

On top of it prefetch is notoriously tricky to use for single-threaded
purposes, making it questionable from the get go.

As such, remove it.

I admit upfront I did not see value in benchmarking this change, but I
can do it if that is deemed appropriate.

Removal from new_inode and of the entire thing are in the same patch as
requested by Linus, so whatever weird looks can be directed at that guy.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-12 09:18:47 -07:00
Andrea Righi
adcd459ff8 ovl: validate superblock in OVL_FS()
When CONFIG_OVERLAY_FS_DEBUG is enabled add an explicit check to make
sure that OVL_FS() is always used with a valid overlayfs superblock.
Otherwise trigger a WARN_ON_ONCE().

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:54 +03:00
Andrea Righi
f01d08899f ovl: make consistent use of OVL_FS()
Always use OVL_FS() to retrieve the corresponding struct ovl_fs from a
struct super_block.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:54 +03:00
Andrea Righi
3629554296 ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG
Provide a Kconfig option to enable extra debugging checks for overlayfs.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:53 +03:00
Amir Goldstein
cbb44f0935 ovl: auto generate uuid for new overlay filesystems
Add a new mount option uuid=auto, which is the default.

If a persistent UUID xattr is found it is used.

Otherwise, an existing ovelrayfs with copied up subdirs in upper dir
that was never mounted with uuid=on retains the null UUID.

A new overlayfs with no copied up subdirs, generates the persistent UUID
on first mount.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:50 +03:00
Amir Goldstein
d9544c1b0d ovl: store persistent uuid/fsid with uuid=on
With uuid=on, store a persistent uuid in xattr on the upper dir to
give the overlayfs instance a persistent identifier.

This also makes f_fsid persistent and more reliable for reporting
fid info in fanotify events.

uuid=on is not supported on non-upper overlayfs or with upper fs
that does not support xattrs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:50 +03:00
Amir Goldstein
b0504bfe1b ovl: add support for unique fsid per instance
The legacy behavior of ovl_statfs() reports the f_fsid filled by
underlying upper fs. This fsid is not unique among overlayfs instances
on the same upper fs.

With mount option uuid=on, generate a non-persistent uuid per overlayfs
instance and use it as the seed for f_fsid, similar to tmpfs.

This is useful for reporting fanotify events with fid info from different
instances of overlayfs over the same upper fs.

The old behavior of null uuid and upper fs fsid is retained with the
mount option uuid=null, which is the default.

The mount option uuid=off that disables uuid checks in underlying layers
also retains the legacy behavior.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:50 +03:00
Amir Goldstein
16aac5ad1f ovl: support encoding non-decodable file handles
When all layers support file handles, we support encoding non-decodable
file handles (a.k.a. fid) even with nfs_export=off.

When file handles do not need to be decoded, we do not need to copy up
redirected lower directories on encode, and we encode also non-indexed
upper with lower file handle, so fid will not change on copy up.

This enables reporting fanotify events with file handles on overlayfs
with default config/mount options.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:47 +03:00
Alexander Larsson
0c71faf5a6 ovl: Handle verity during copy-up
During regular metacopy, if lowerdata file has fs-verity enabled, and
the verity option is enabled, we add the digest to the metacopy xattr.

If verity is required, and lowerdata does not have fs-verity enabled,
fall back to full copy-up (or the generated metacopy would not
validate).

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:38 +03:00
Alexander Larsson
184996e92e ovl: Validate verity xattr when resolving lowerdata
The new digest field in the metacopy xattr is used during lookup to
record whether the header contained a digest in the OVL_HAS_DIGEST
flags.

When accessing file data the first time, if OVL_HAS_DIGEST is set, we
reload the metadata and check that the source lowerdata inode matches
the specified digest in it (according to the enabled verity
options). If the verity check passes we store this info in the inode
flags as OVL_VERIFIED_DIGEST, so that we can avoid doing it again if
the inode remains in memory.

The verification is done in ovl_maybe_validate_verity() which needs to
be called in the same places as ovl_maybe_lookup_lowerdata(), so there
is a new ovl_verify_lowerdata() helper that calls these in the right
order, and all current callers of ovl_maybe_lookup_lowerdata() are
changed to call it instead.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:38 +03:00
Alexander Larsson
bf07089081 ovl: Add versioned header for overlay.metacopy xattr
Historically overlay.metacopy was a zero-size xattr, and it's
existence marked a metacopy file. This change adds a versioned header
with a flag field, a length and a digest. The initial use-case of this
will be for validating a fs-verity digest, but the flags field could
also be used later for other new features.

ovl_check_metacopy_xattr() now returns the size of the xattr,
emulating a size of OVL_METACOPY_MIN_SIZE for empty xattrs to
distinguish it from the no-xattr case.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:38 +03:00
Alexander Larsson
ae8cba4033 ovl: Add framework for verity support
This adds the scaffolding (docs, config, mount options) for supporting
the new digest field in the metacopy xattr. This contains a fs-verity
digest that need to match the fs-verity digest of the lowerdata
file. The mount option "verity" specifies how this xattr is handled.

If you enable verity ("verity=on") all existing xattrs are validated
before use, and during metacopy we generate verity xattr in the upper
metacopy file (if the source file has verity enabled). This means
later accesses can guarantee that the same data is used.

Additionally you can use "verity=require". In this mode all metacopy
files must have a valid verity xattr. For this to work metadata
copy-up must be able to create a verity xattr (so that later accesses
are validated). Therefore, in this mode, if the lower data file
doesn't have fs-verity enabled we fall back to a full copy rather than
a metacopy.

Actual implementation follows in a separate commit.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12 19:02:38 +03:00