587 Commits

Author SHA1 Message Date
Jaegeuk Kim
d420c4a06d f2fs: enforce single zone capacity
[ Upstream commit b771aadc6e4c221a468fe4a2dfcfffec01a06722 ]

In order to simplify the complicated per-zone capacity, let's support
only one capacity for entire zoned device.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 0b37ed21e336 ("f2fs: apply zone capacity to all zone type")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-17 11:47:49 +02:00
Eric Biggers
33909b1a64 f2fs: use memcpy_{to,from}_page() where possible
[ Upstream commit b87846bd61c7c09560617da416208a5454530d57 ]

This is simpler, and as a side effect it replaces several uses of
kmap_atomic() with its recommended replacement kmap_local_page().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: b1b9896718bc ("fs: f2fs: initialize fsdata in pagecache_write()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11 16:40:09 +01:00
Jaegeuk Kim
12014eaf1b f2fs: increase the limit for reserve_root
commit da35fe96d12d15779f3cb74929b7ed03941cf983 upstream.

This patch increases the threshold that limits the reserved root space from 0.2%
to 12.5% by using simple shift operation.

Typically Android sets 128MB, but if the storage capacity is 32GB, 0.2% which is
around 64MB becomes too small. Let's relax it.

Cc: stable@vger.kernel.org
Reported-by: Aran Dalton <arda@allwinnertech.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26 13:25:12 +02:00
Jaegeuk Kim
196f72e089 f2fs: don't need inode lock for system hidden quota
commit 6213f5d4d23c50d393a31dc8e351e63a1fd10dbe upstream.

Let's avoid false-alarmed lockdep warning.

[   58.914674] [T1501146] -> #2 (&sb->s_type->i_mutex_key#20){+.+.}-{3:3}:
[   58.915975] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.916738] [T1501146] system_server:        f2fs_quota_sync+0x60/0x1a8
[   58.917563] [T1501146] system_server:        block_operations+0x16c/0x43c
[   58.918410] [T1501146] system_server:        f2fs_write_checkpoint+0x114/0x318
[   58.919312] [T1501146] system_server:        f2fs_issue_checkpoint+0x178/0x21c
[   58.920214] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.920999] [T1501146] system_server:        f2fs_do_sync_file+0x334/0x738
[   58.921862] [T1501146] system_server:        f2fs_sync_file+0x30/0x48
[   58.922667] [T1501146] system_server:        __arm64_sys_fsync+0x84/0xf8
[   58.923506] [T1501146] system_server:        el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[   58.924604] [T1501146] system_server:        do_el0_svc+0x28/0xa0
[   58.925366] [T1501146] system_server:        el0_svc+0x24/0x38
[   58.926094] [T1501146] system_server:        el0_sync_handler+0x88/0xec
[   58.926920] [T1501146] system_server:        el0_sync+0x1b4/0x1c0

[   58.927681] [T1501146] -> #1 (&sbi->cp_global_sem){+.+.}-{3:3}:
[   58.928889] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.929650] [T1501146] system_server:        f2fs_write_checkpoint+0xbc/0x318
[   58.930541] [T1501146] system_server:        f2fs_issue_checkpoint+0x178/0x21c
[   58.931443] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.932226] [T1501146] system_server:        sync_filesystem+0xac/0x130
[   58.933053] [T1501146] system_server:        generic_shutdown_super+0x38/0x150
[   58.933958] [T1501146] system_server:        kill_block_super+0x24/0x58
[   58.934791] [T1501146] system_server:        kill_f2fs_super+0xcc/0x124
[   58.935618] [T1501146] system_server:        deactivate_locked_super+0x90/0x120
[   58.936529] [T1501146] system_server:        deactivate_super+0x74/0xac
[   58.937356] [T1501146] system_server:        cleanup_mnt+0x128/0x168
[   58.938150] [T1501146] system_server:        __cleanup_mnt+0x18/0x28
[   58.938944] [T1501146] system_server:        task_work_run+0xb8/0x14c
[   58.939749] [T1501146] system_server:        do_notify_resume+0x114/0x1e8
[   58.940595] [T1501146] system_server:        work_pending+0xc/0x5f0

[   58.941375] [T1501146] -> #0 (&sbi->gc_lock){+.+.}-{3:3}:
[   58.942519] [T1501146] system_server:        __lock_acquire+0x1270/0x2868
[   58.943366] [T1501146] system_server:        lock_acquire+0x114/0x294
[   58.944169] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.944930] [T1501146] system_server:        f2fs_issue_checkpoint+0x13c/0x21c
[   58.945831] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.946614] [T1501146] system_server:        f2fs_do_sync_file+0x334/0x738
[   58.947472] [T1501146] system_server:        f2fs_ioc_commit_atomic_write+0xc8/0x14c
[   58.948439] [T1501146] system_server:        __f2fs_ioctl+0x674/0x154c
[   58.949253] [T1501146] system_server:        f2fs_ioctl+0x54/0x88
[   58.950018] [T1501146] system_server:        __arm64_sys_ioctl+0xa8/0x110
[   58.950865] [T1501146] system_server:        el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[   58.951965] [T1501146] system_server:        do_el0_svc+0x28/0xa0
[   58.952727] [T1501146] system_server:        el0_svc+0x24/0x38
[   58.953454] [T1501146] system_server:        el0_sync_handler+0x88/0xec
[   58.954279] [T1501146] system_server:        el0_sync+0x1b4/0x1c0

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09 10:21:21 +02:00
Juhyung Park
e58ee6bd93 f2fs: quota: fix loop condition at f2fs_quota_sync()
commit 680af5b824a52faa819167628665804a14f0e0df upstream.

cnt should be passed to sb_has_quota_active() instead of type to check
active quota properly.

Moreover, when the type is -1, the compiler with enough inline knowledge
can discard sb_has_quota_active() check altogether, causing a NULL pointer
dereference at the following inode_lock(dqopt->files[cnt]):

[    2.796010] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[    2.796024] Mem abort info:
[    2.796025]   ESR = 0x96000005
[    2.796028]   EC = 0x25: DABT (current EL), IL = 32 bits
[    2.796029]   SET = 0, FnV = 0
[    2.796031]   EA = 0, S1PTW = 0
[    2.796032] Data abort info:
[    2.796034]   ISV = 0, ISS = 0x00000005
[    2.796035]   CM = 0, WnR = 0
[    2.796046] user pgtable: 4k pages, 39-bit VAs, pgdp=00000003370d1000
[    2.796048] [00000000000000a0] pgd=0000000000000000, pud=0000000000000000
[    2.796051] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[    2.796056] CPU: 7 PID: 640 Comm: f2fs_ckpt-259:7 Tainted: G S                5.4.179-arter97-r8-64666-g2f16e087f9d8 #1
[    2.796057] Hardware name: Qualcomm Technologies, Inc. Lahaina MTP lemonadep (DT)
[    2.796059] pstate: 80c00005 (Nzcv daif +PAN +UAO)
[    2.796065] pc : down_write+0x28/0x70
[    2.796070] lr : f2fs_quota_sync+0x100/0x294
[    2.796071] sp : ffffffa3f48ffc30
[    2.796073] x29: ffffffa3f48ffc30 x28: 0000000000000000
[    2.796075] x27: ffffffa3f6d718b8 x26: ffffffa415fe9d80
[    2.796077] x25: ffffffa3f7290048 x24: 0000000000000001
[    2.796078] x23: 0000000000000000 x22: ffffffa3f7290000
[    2.796080] x21: ffffffa3f72904a0 x20: ffffffa3f7290110
[    2.796081] x19: ffffffa3f77a9800 x18: ffffffc020aae038
[    2.796083] x17: ffffffa40e38e040 x16: ffffffa40e38e6d0
[    2.796085] x15: ffffffa40e38e6cc x14: ffffffa40e38e6d0
[    2.796086] x13: 00000000000004f6 x12: 00162c44ff493000
[    2.796088] x11: 0000000000000400 x10: ffffffa40e38c948
[    2.796090] x9 : 0000000000000000 x8 : 00000000000000a0
[    2.796091] x7 : 0000000000000000 x6 : 0000d1060f00002a
[    2.796093] x5 : ffffffa3f48ff718 x4 : 000000000000000d
[    2.796094] x3 : 00000000060c0000 x2 : 0000000000000001
[    2.796096] x1 : 0000000000000000 x0 : 00000000000000a0
[    2.796098] Call trace:
[    2.796100]  down_write+0x28/0x70
[    2.796102]  f2fs_quota_sync+0x100/0x294
[    2.796104]  block_operations+0x120/0x204
[    2.796106]  f2fs_write_checkpoint+0x11c/0x520
[    2.796107]  __checkpoint_and_complete_reqs+0x7c/0xd34
[    2.796109]  issue_checkpoint_thread+0x6c/0xb8
[    2.796112]  kthread+0x138/0x414
[    2.796114]  ret_from_fork+0x10/0x18
[    2.796117] Code: aa0803e0 aa1f03e1 52800022 aa0103e9 (c8e97d02)
[    2.796120] ---[ end trace 96e942e8eb6a0b53 ]---
[    2.800116] Kernel panic - not syncing: Fatal exception
[    2.800120] SMP: stopping secondary CPUs

Fixes: 9de71ede81e6 ("f2fs: quota: fix potential deadlock")
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08 14:39:51 +02:00
Chao Yu
a49e402f23 f2fs: fix to reserve space for IO align feature
commit 300a842937fbcfb5a189cea9ba15374fdb0b5c6b upstream.

https://bugzilla.kernel.org/show_bug.cgi?id=204137

With below script, we will hit panic during new segment allocation:

DISK=bingo.img
MOUNT_DIR=/mnt/f2fs

dd if=/dev/zero of=$DISK bs=1M count=105
mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK

mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict"

for (( i = 0; i < 4096; i++ )); do
	name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10`
	mkdir $MOUNT_DIR/$name
done

umount $MOUNT_DIR
rm $DISK
2022-01-27 10:54:31 +01:00
Chao Yu
7a5439474e f2fs: fix incorrect return value in f2fs_sanity_check_ckpt()
[ Upstream commit ca98d72141dd81f42893a9a43d7ededab3355fba ]

As Pavel Machek reported in [1]

This code looks quite confused: part of function returns 1 on
corruption, part returns -errno. The problem is not stable-specific.

[1] https://lkml.org/lkml/2021/9/19/207

Let's fix to make 'insane cp_payload case' to return 1 rater than
EFSCORRUPTED, so that return value can be kept consistent for all
error cases, it can avoid confusion of code logic.

Fixes: 65ddf6564843 ("f2fs: fix to do sanity check for sb/cp fields correctly")
Reported-by: Pavel Machek <pavel@denx.de>
Reviewed-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-26 10:39:12 +01:00
Keoseong Park
3d7c5d08a4 f2fs: fix to use WHINT_MODE
[ Upstream commit 011e0868e0cf1237675b22e36fffa958fb08f46e ]

Since active_logs can be set to 2 or 4 or NR_CURSEG_PERSIST_TYPE(6),
it cannot be set to NR_CURSEG_TYPE(8).
That is, whint_mode is always off.

Therefore, the condition is changed from NR_CURSEG_TYPE to NR_CURSEG_PERSIST_TYPE.

Cc: Chao Yu <chao@kernel.org>
Fixes: d0b9e42ab615 (f2fs: introduce inmem curseg)
Reported-by: tanghuan <tanghuan@vivo.com>
Signed-off-by: Keoseong Park <keosung.park@samsung.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-26 10:39:12 +01:00
Chao Yu
2d586a3f5b f2fs: quota: fix potential deadlock
[ Upstream commit 9de71ede81e6d1a111fdd868b2d78d459fa77f80 ]

xfstest generic/587 reports a deadlock issue as below:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc1 #69 Not tainted
------------------------------------------------------
repquota/8606 is trying to acquire lock:
ffff888022ac9320 (&sb->s_type->i_mutex_key#18){+.+.}-{3:3}, at: f2fs_quota_sync+0x207/0x300 [f2fs]

but task is already holding lock:
ffff8880084bcde8 (&sbi->quota_sem){.+.+}-{3:3}, at: f2fs_quota_sync+0x59/0x300 [f2fs]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&sbi->quota_sem){.+.+}-{3:3}:
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_read+0x3b/0x2a0
       f2fs_quota_sync+0x59/0x300 [f2fs]
       f2fs_quota_on+0x48/0x100 [f2fs]
       do_quotactl+0x5e3/0xb30
       __x64_sys_quotactl+0x23a/0x4e0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #1 (&sbi->cp_rwsem){++++}-{3:3}:
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_read+0x3b/0x2a0
       f2fs_unlink+0x353/0x670 [f2fs]
       vfs_unlink+0x1c7/0x380
       do_unlinkat+0x413/0x4b0
       __x64_sys_unlinkat+0x50/0xb0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&sb->s_type->i_mutex_key#18){+.+.}-{3:3}:
       check_prev_add+0xdc/0xb30
       validate_chain+0xa67/0xb20
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_write+0x39/0xc0
       f2fs_quota_sync+0x207/0x300 [f2fs]
       do_quotactl+0xaff/0xb30
       __x64_sys_quotactl+0x23a/0x4e0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  &sb->s_type->i_mutex_key#18 --> &sbi->cp_rwsem --> &sbi->quota_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&sbi->quota_sem);
                               lock(&sbi->cp_rwsem);
                               lock(&sbi->quota_sem);
  lock(&sb->s_type->i_mutex_key#18);

 *** DEADLOCK ***

3 locks held by repquota/8606:
 #0: ffff88801efac0e0 (&type->s_umount_key#53){++++}-{3:3}, at: user_get_super+0xd9/0x190
 #1: ffff8880084bc380 (&sbi->cp_rwsem){++++}-{3:3}, at: f2fs_quota_sync+0x3e/0x300 [f2fs]
 #2: ffff8880084bcde8 (&sbi->quota_sem){.+.+}-{3:3}, at: f2fs_quota_sync+0x59/0x300 [f2fs]

stack backtrace:
CPU: 6 PID: 8606 Comm: repquota Not tainted 5.14.0-rc1 #69
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
 dump_stack_lvl+0xce/0x134
 dump_stack+0x17/0x20
 print_circular_bug.isra.0.cold+0x239/0x253
 check_noncircular+0x1be/0x1f0
 check_prev_add+0xdc/0xb30
 validate_chain+0xa67/0xb20
 __lock_acquire+0x648/0x10b0
 lock_acquire+0x128/0x470
 down_write+0x39/0xc0
 f2fs_quota_sync+0x207/0x300 [f2fs]
 do_quotactl+0xaff/0xb30
 __x64_sys_quotactl+0x23a/0x4e0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f883b0b4efe

The root cause is ABBA deadlock of inode lock and cp_rwsem,
reorder locks in f2fs_quota_sync() as below to fix this issue:
- lock inode
- lock cp_rwsem
- lock quota_sem

Fixes: db6ec53b7e03 ("f2fs: add a rw_sem to cover quota flag changes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-18 13:40:11 +02:00
Chao Yu
74d6dfcb0f f2fs: fix to do sanity check for sb/cp fields correctly
commit 65ddf6564843890a58ee3b18bb46ce67d96333fb upstream.

This patch fixes below problems of sb/cp sanity check:
- in sanity_check_raw_superi(), it missed to consider log header
blocks while cp_payload check.
- in f2fs_sanity_check_ckpt(), it missed to check nat_bits_blocks.

Cc: <stable@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-18 13:40:10 +02:00
Jaegeuk Kim
611b7f9dc9 f2fs: guarantee to write dirty data when enabling checkpoint back
commit dddd3d65293a52c2c3850c19b1e5115712e534d8 upstream.

We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15 09:50:46 +02:00
Chao Yu
43cefd1264 f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
[ Upstream commit 0dd571785d61528d62cdd8aa49d76bc6085152fe ]

As marcosfrm reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=213089

Initramfs generators rely on "pre" softdeps (and "depends") to include
additional required modules.

F2FS does not declare "pre: crc32" softdep. Then every generator (dracut,
mkinitcpio...) has to maintain a hardcoded list for this purpose.

Hence let's use MODULE_SOFTDEP("pre: crc32") in f2fs code.

Fixes: 43b6573bac95 ("f2fs: use cryptoapi crc32 functions")
Reported-by: marcosfrm <marcosfrm@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:05:50 +02:00
Chao Yu
8678660301 f2fs: fix to avoid racing on fsync_entry_slab by multi filesystem instances
commit cad83c968c2ebe97905f900326988ed37146c347 upstream.

As syzbot reported, there is an use-after-free issue during f2fs recovery:

Use-after-free write at 0xffff88823bc16040 (in kfence-#10):
 kmem_cache_destroy+0x1f/0x120 mm/slab_common.c:486
 f2fs_recover_fsync_data+0x75b0/0x8380 fs/f2fs/recovery.c:869
 f2fs_fill_super+0x9393/0xa420 fs/f2fs/super.c:3945
 mount_bdev+0x26c/0x3a0 fs/super.c:1367
 legacy_get_tree+0xea/0x180 fs/fs_context.c:592
 vfs_get_tree+0x86/0x270 fs/super.c:1497
 do_new_mount fs/namespace.c:2905 [inline]
 path_mount+0x196f/0x2be0 fs/namespace.c:3235
 do_mount fs/namespace.c:3248 [inline]
 __do_sys_mount fs/namespace.c:3456 [inline]
 __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
 do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is multi f2fs filesystem instances can race on accessing
global fsync_entry_slab pointer, result in use-after-free issue of slab
cache, fixes to init/destroy this slab cache only once during module
init/destroy procedure to avoid this issue.

Reported-by: syzbot+9d90dad32dd9727ed084@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19 09:45:03 +02:00
Chao Yu
81ba1634d1 f2fs: fix to allow migrating fully valid segment
[ Upstream commit 7dede88659df38f96128ab3922c50dde2d29c574 ]

F2FS_IOC_FLUSH_DEVICE/F2FS_IOC_RESIZE_FS needs to migrate all blocks of
target segment to other place, no matter the segment has partially or fully
valid blocks.

However, after commit 803e74be04b3 ("f2fs: stop GC when the victim becomes
fully valid"), we may skip migration due to target segment is fully valid,
result in failing the ioctl interface, fix this.

Fixes: 803e74be04b3 ("f2fs: stop GC when the victim becomes fully valid")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19 10:13:00 +02:00
Jaegeuk Kim
f98be16898 f2fs: flush data when enabling checkpoint back
commit b0ff4fe746fd028eef920ddc8c7b0361c1ede6ec upstream.

During checkpoint=disable period, f2fs bypasses all the synchronous IOs such as
sync and fsync. So, when enabling it back, we must flush all of them in order
to keep the data persistent. Otherwise, suddern power-cut right after enabling
checkpoint will cause data loss.

Fixes: 4354994f097d ("f2fs: checkpoint disabling")
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:43 +01:00
Chao Yu
1c5a034710 f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
commit e584bbe821229a3e7cc409eecd51df66f9268c21 upstream.

syzbot reported a bug which could cause shift-out-of-bounds issue,
fix it.

Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 sanity_check_raw_super fs/f2fs/super.c:2812 [inline]
 read_raw_super_block fs/f2fs/super.c:3267 [inline]
 f2fs_fill_super.cold+0x16c9/0x16f6 fs/f2fs/super.c:3519
 mount_bdev+0x34d/0x410 fs/super.c:1366
 legacy_get_tree+0x105/0x220 fs/fs_context.c:592
 vfs_get_tree+0x89/0x2f0 fs/super.c:1496
 do_new_mount fs/namespace.c:2896 [inline]
 path_mount+0x12ae/0x1e70 fs/namespace.c:3227
 do_mount fs/namespace.c:3240 [inline]
 __do_sys_mount fs/namespace.c:3448 [inline]
 __se_sys_mount fs/namespace.c:3425 [inline]
 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+ca9a785f8ac472085994@syzkaller.appspotmail.com
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06 14:56:52 +01:00
Hyeongseok Kim
9c14fb58a1 f2fs: fix double free of unicode map
[ Upstream commit 89ff6005039a878afac87889fee748fa3f957c3a ]

In case of retrying fill_super with skip_recovery,
s_encoding for casefold would not be loaded again even though it's
already been freed because it's not NULL.
Set NULL after free to prevent double freeing when unmount.

Fixes: eca4873ee1b6 ("f2fs: Use generic casefolding support")
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:31 +01:00
Linus Torvalds
0eac1102e9 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place (the largest group here is
  Christoph's stat cleanups)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove KSTAT_QUERY_FLAGS
  fs: remove vfs_stat_set_lookup_flags
  fs: move vfs_fstatat out of line
  fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
  fs: remove vfs_statx_fd
  fs: omfs: use kmemdup() rather than kmalloc+memcpy
  [PATCH] reduce boilerplate in fsid handling
  fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
  selftests: mount: add nosymfollow tests
  Add a "nosymfollow" mount option.
2020-10-24 12:26:05 -07:00
Linus Torvalds
7a3dadedc8 f2fs-for-5.10-rc1
In this round, we've added new features such as zone capacity for ZNS and
 a new GC policy, ATGC, along with in-memory segment management. In addition,
 we could improve the decompression speed significantly by changing virtual
 mapping method. Even though we've fixed lots of small bugs in compression
 support, I feel that it becomes more stable so that I could give it a try in
 production.
 
 Enhancement:
  - suport zone capacity in NVMe Zoned Namespace devices
  - introduce in-memory current segment management
  - add standart casefolding support
  - support age threshold based garbage collection
  - improve decompression speed by changing virtual mapping method
 
 Bug fix:
  - fix condition checks in some ioctl() such as compression, move_range, etc
  - fix 32/64bits support in data structures
  - fix memory allocation in zstd decompress
  - add some boundary checks to avoid kernel panic on corrupted image
  - fix disallowing compression for non-empty file
  - fix slab leakage of compressed block writes
 
 In addition, it includes code refactoring for better readability and minor
 bug fixes for compression and zoned device support.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl+I0nQACgkQQBSofoJI
 UNIJdw//Rj0YapYXSu1nlOQppzSAkCiL6wxrrG2qkisRE7uXSYGZfWBqrMT6Asnf
 i0f25i6ywZOZ00dgN/klZRBh4YYSgJqYx9BPkTxZsQZ/S/EmZJPpr8m4VUB69LKL
 VijwUdgcW9vNDJ2/DkDQDVBd/ZqRxXnltffWtP4pS96Gj089/dE2q8KXqQrt3LM1
 lLQjDfHj+0AyWRzKpErTO0W9DOgO7wmmelS0h6m2RYttkbb328JEZezg5bjWNNlk
 eTXxuAFFy8Ap9DngkC/sqvY2NRTv1YgOPfrT8XWwdDIiFTZ+LoYdFI5Ap/UW7QwG
 iz7B/0wPj3+9ncl536LRbFPiLisbYrArYGmZKF6t8w1cP6mTVlileMT4Q6s9+qhn
 GJS88tTBVlR9vbzDu2brjI6qRQVTBdsohIGoA1g6lz0ogbphhmTzujPbFQ6GTSBi
 3sKKp59urkBpVH3TVJU1oshLjIEG2yToMgYwZH9DU7zlzJS6XpetJrzReqItEThc
 VNixg2DxdIFQ+nrMt+LtWaOHs5qzxIIPksguGEhqSkLL5lI75n2MZxrhKXmUsaZa
 qItJE0ndJFfi6vggIkJID+a0bpTss7+AxF1AmSZDafMLkZy8j14DNQAnmBUYRX0J
 5QExZ+LyyaKjWQ/k9SsSzV1Y3dbguDyB+gkeMhr/6XEr9DSwZc8=
 =exgv
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added new features such as zone capacity for ZNS
  and a new GC policy, ATGC, along with in-memory segment management. In
  addition, we could improve the decompression speed significantly by
  changing virtual mapping method. Even though we've fixed lots of small
  bugs in compression support, I feel that it becomes more stable so
  that I could give it a try in production.

  Enhancements:
   - suport zone capacity in NVMe Zoned Namespace devices
   - introduce in-memory current segment management
   - add standart casefolding support
   - support age threshold based garbage collection
   - improve decompression speed by changing virtual mapping method

  Bug fixes:
   - fix condition checks in some ioctl() such as compression, move_range, etc
   - fix 32/64bits support in data structures
   - fix memory allocation in zstd decompress
   - add some boundary checks to avoid kernel panic on corrupted image
   - fix disallowing compression for non-empty file
   - fix slab leakage of compressed block writes

  In addition, it includes code refactoring for better readability and
  minor bug fixes for compression and zoned device support"

* tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
  f2fs: code cleanup by removing unnecessary check
  f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
  f2fs: fix writecount false positive in releasing compress blocks
  f2fs: introduce check_swap_activate_fast()
  f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
  f2fs: handle errors of f2fs_get_meta_page_nofail
  f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
  f2fs: reject CASEFOLD inode flag without casefold feature
  f2fs: fix memory alignment to support 32bit
  f2fs: fix slab leak of rpages pointer
  f2fs: compress: fix to disallow enabling compress on non-empty file
  f2fs: compress: introduce cic/dic slab cache
  f2fs: compress: introduce page array slab cache
  f2fs: fix to do sanity check on segment/section count
  f2fs: fix to check segment boundary during SIT page readahead
  f2fs: fix uninit-value in f2fs_lookup
  f2fs: remove unneeded parameter in find_in_block()
  f2fs: fix wrong total_sections check and fsmeta check
  f2fs: remove duplicated code in sanity_check_area_boundary
  f2fs: remove unused check on version_bitmap
  ...
2020-10-16 15:14:43 -07:00
Chao Yu
c68d6c8830 f2fs: compress: introduce cic/dic slab cache
Add two slab caches: "f2fs_cic_entry" and "f2fs_dic_entry" for memory
allocation of compress_io_ctx and decompress_io_ctx structure.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 09:16:36 -07:00
Chao Yu
3108303170 f2fs: compress: introduce page array slab cache
Add a per-sbi slab cache "f2fs_page_array_entry-%u:%u" for memory
allocation of page pointer array in compress context.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Fix wrong memory allocation]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 09:16:32 -07:00
Chao Yu
3a22e9ac71 f2fs: fix to do sanity check on segment/section count
As syzbot reported:

BUG: KASAN: slab-out-of-bounds in init_min_max_mtime fs/f2fs/segment.c:4710 [inline]
BUG: KASAN: slab-out-of-bounds in f2fs_build_segment_manager+0x9302/0xa6d0 fs/f2fs/segment.c:4792
Read of size 8 at addr ffff8880a1b934a8 by task syz-executor682/6878

CPU: 1 PID: 6878 Comm: syz-executor682 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
 __kasan_report mm/kasan/report.c:513 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
 init_min_max_mtime fs/f2fs/segment.c:4710 [inline]
 f2fs_build_segment_manager+0x9302/0xa6d0 fs/f2fs/segment.c:4792
 f2fs_fill_super+0x381a/0x6e80 fs/f2fs/super.c:3633
 mount_bdev+0x32e/0x3f0 fs/super.c:1417
 legacy_get_tree+0x105/0x220 fs/fs_context.c:592
 vfs_get_tree+0x89/0x2f0 fs/super.c:1547
 do_new_mount fs/namespace.c:2875 [inline]
 path_mount+0x1387/0x20a0 fs/namespace.c:3192
 do_mount fs/namespace.c:3205 [inline]
 __do_sys_mount fs/namespace.c:3413 [inline]
 __se_sys_mount fs/namespace.c:3390 [inline]
 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The root cause is: if segs_per_sec is larger than one, and segment count
in last section is less than segs_per_sec, we will suffer out-of-boundary
memory access on sit_i->sentries[] in init_min_max_mtime().

Fix this by adding sanity check among segment count, section count and
segs_per_sec value in sanity_check_raw_super().

Reported-by: syzbot+481a3ffab50fed41dcc0@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 02:12:41 -07:00
Wang Xiaojun
f99ba9add6 f2fs: fix wrong total_sections check and fsmeta check
Meta area is not included in section_count computation.
So the minimum number of total_sections is 1 meanwhile it cannot be
greater than segment_count_main.

The minimum number of meta segments is 8 (SB + 2 (CP + SIT + NAT) + SSA).

Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 01:48:34 -07:00
Wang Xiaojun
d89f589130 f2fs: remove duplicated code in sanity_check_area_boundary
Use seg_end_blkaddr instead of "segment0_blkaddr + (segment_count <<
log_blocks_per_seg)".

Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 01:48:34 -07:00
Chao Yu
d0660122dc f2fs: relocate blkzoned feature check
Relocate blkzoned feature check into parse_options() like
other feature check.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 01:48:33 -07:00
Chao Yu
07eb1d6994 f2fs: do sanity check on zoned block device path
sbi->devs would be initialized only if image enables multiple device
feature or blkzoned feature, if blkzoned feature flag was set by fuzz
in non-blkzoned device, we will suffer below panic:

get_zone_idx fs/f2fs/segment.c:4892 [inline]
f2fs_usable_zone_blks_in_seg fs/f2fs/segment.c:4943 [inline]
f2fs_usable_blks_in_seg+0x39b/0xa00 fs/f2fs/segment.c:4999
Call Trace:
 check_block_count+0x69/0x4e0 fs/f2fs/segment.h:704
 build_sit_entries fs/f2fs/segment.c:4403 [inline]
 f2fs_build_segment_manager+0x51da/0xa370 fs/f2fs/segment.c:5100
 f2fs_fill_super+0x3880/0x6ff0 fs/f2fs/super.c:3684
 mount_bdev+0x32e/0x3f0 fs/super.c:1417
 legacy_get_tree+0x105/0x220 fs/fs_context.c:592
 vfs_get_tree+0x89/0x2f0 fs/super.c:1547
 do_new_mount fs/namespace.c:2896 [inline]
 path_mount+0x12ae/0x1e70 fs/namespace.c:3216
 do_mount fs/namespace.c:3229 [inline]
 __do_sys_mount fs/namespace.c:3437 [inline]
 __se_sys_mount fs/namespace.c:3414 [inline]
 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3414
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46

Add sanity check to inconsistency on factors: blkzoned flag, device
path and device character to avoid above panic.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29 01:48:33 -07:00
Eric Biggers
c8c868abc9 fscrypt: make fscrypt_set_test_dummy_encryption() take a 'const char *'
fscrypt_set_test_dummy_encryption() requires that the optional argument
to the test_dummy_encryption mount option be specified as a substring_t.
That doesn't work well with filesystems that use the new mount API,
since the new way of parsing mount options doesn't use substring_t.

Make it take the argument as a 'const char *' instead.

Instead of moving the match_strdup() into the callers in ext4 and f2fs,
make them just use arg->from directly.  Since the pattern is
"test_dummy_encryption=%s", the argument will be null-terminated.

Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22 06:48:52 -07:00
Eric Biggers
ac4acb1f4b fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.

Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing.  When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory.  This isn't actually used to do any
encryption, however, since the directory is still unencrypted!  Instead,
->i_crypt_info is only used for inheriting the encryption policy.

One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy".  In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required.  However, actually the nonce only
ends up being used to derive a key that is never used.

Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about.  For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption.  That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.

Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories.  This involves:

- Adding a function fscrypt_policy_to_inherit() which returns the
  encryption policy to inherit from a directory.  This can be a real
  policy, a dummy policy, or no policy.

- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
  with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.

- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
  of an inode.

Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22 06:48:49 -07:00
Al Viro
6d1349c769 [PATCH] reduce boilerplate in fsid handling
Get rid of boilerplate in most of ->statfs()
instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-18 16:45:50 -04:00
Daeho Jeong
c2759ebaf7 f2fs: change i_compr_blocks of inode to atomic value
writepages() can be concurrently invoked for the same file by different
threads such as a thread fsyncing the file and a kworker kernel thread.
So, changing i_compr_blocks without protection is racy and we need to
protect it by changing it with atomic type value. Plus, we don't need
a 64bit value for i_compr_blocks, so just we will use a atomic value,
not atomic64.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11 11:11:26 -07:00
Chao Yu
69c0dd29f7 f2fs: ignore compress mount option on image w/o compression feature
to keep consistent with behavior when passing compress mount option
to kernel w/o compression feature, so that mount may not fail on
such condition.

Reported-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11 11:11:25 -07:00
Chao Yu
093749e296 f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.

This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:

1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
 0 means youngest, 100 means oldest, if we set age threshold to 80
 then select dirty segments which has age in range of [80, 100] as
 candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;

2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.

3. merge valid blocks from source victim into target victim with
SSR alloctor.

Test steps:
- create 160 dirty segments:
 * half of them have 128 valid blocks per segment
 * left of them have 384 valid blocks per segment
- run background GC

Benefit: GC count and block movement count both decrease obviously:

- Before:
  - Valid: 86
  - Dirty: 1
  - Prefree: 11
  - Free: 6001 (6001)

GC calls: 162 (BG: 220)
  - data segments : 160 (160)
  - node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
  - data blocks : 40960 (40960)
  - node blocks : 494 (494)

IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments

- After:

  - Valid: 87
  - Dirty: 0
  - Prefree: 4
  - Free: 6008 (6008)

GC calls: 75 (BG: 76)
  - data segments : 74 (74)
  - node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
  - data blocks : 12544 (12544)
  - node blocks : 269 (269)

IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11 11:11:15 -07:00
Daniel Rosenberg
eca4873ee1 f2fs: Use generic casefolding support
This switches f2fs over to the generic support provided in
the previous patch.

Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10 14:03:31 -07:00
Chao Yu
d0b9e42ab6 f2fs: introduce inmem curseg
Previous implementation of aligned pinfile allocation will:
- allocate new segment on cold data log no matter whether last used
segment is partially used or not, it makes IOs more random;
- force concurrent cold data/GCed IO going into warm data area, it
can make a bad effect on hot/cold data separation;

In this patch, we introduce a new type of log named 'inmem curseg',
the differents from normal curseg is:
- it reuses existed segment type (CURSEG_XXX_NODE/DATA);
- it only exists in memory, its segno, blkofs, summary will not b
 persisted into checkpoint area;

With this new feature, we can enhance scalability of log, special
allocators can be created for purposes:
- pure lfs allocator for aligned pinfile allocation or file
defragmentation
- pure ssr allocator for later feature

So that, let's update aligned pinfile allocation to use this new
inmem curseg fwk.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10 14:03:30 -07:00
Aravind Ramesh
de881df977 f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.

If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.

During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.

Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.

A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:

SLBA: 0x0     WP: 0x0     Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ

Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.

Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10 14:03:29 -07:00
Linus Torvalds
086ba2ec16 f2fs-for-5.9-rc1
In this round, we've added two small interfaces, 1) GC_URGENT_LOW mode for
 performance, and 2) F2FS_IOC_SEC_TRIM_FILE ioctl for security. The new GC
 mode allows Android to run some lower priority GCs in background, while new
 ioctl discards user information without race condition when the account is
 removed. In addition, some patches were merged to address latency-related
 issues. We've fixed some compression-related bug fixes as well as edge race
 conditions.
 
 Enhancement:
  - add GC_URGENT_LOW mode in gc_urgent
  - introduce F2FS_IOC_SEC_TRIM_FILE ioctl
  - bypass racy readahead to improve read latencies
  - shrink node_write lock coverage to avoid long latency
 
 Bug fix:
  - fix missing compression flag control, i_size, and mount option
  - fix deadlock between quota writes and checkpoint
  - remove inode eviction path in synchronous path to avoid deadlock
  - fix to wait GCed compressed page writeback
  - fix a kernel panic in f2fs_is_compressed_page
  - check page dirty status before writeback
  - wait page writeback before update in node page write flow
  - fix a race condition between f2fs_write_end_io and f2fs_del_fsync_node_entry
 
 We've added some minor sanity checks and refactored trivial code blocks for
 better readability and debugging information.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl8xhqMACgkQQBSofoJI
 UNLYgA//WMoOqBACDuOWwYmgQ8oq4vrH2LOwssZF9/77vEfaHKc+TSq1il54lcUl
 MPEx7FK54CnfT8VLLR5ByobZFyH9FFeAw4FN4LBhcfE8jh8ysAGjeoZjwfcmJF6R
 cVKtn8ltUpgH3IEUuPjTiKkVNHfVJxuuL3zHbg1CEl+AkR6NJ/U9kNLwf7ZgPWq2
 I0qwyXRlUIEChhyPZB+Y6RsdGjkeievKld56DMCgG73f4yHRO/yBcrfsN875sGdM
 ALL+mYiunMT6aXcfoiQiAjeImoNajuflI6Zso2Sk8Vl6sBj0QwAuEBF5x1Z5e1mt
 YVYNuC4ucqsDBKpOqtsPP0MFTC2T5Rr9wWXjqv+9TjN7zvJ8xx+zDWtQxvI2bpqB
 4ZRxaJP45aThLYh/SEYDmj+ppyPtfLDeG0HzUkwMmuopf9eg+kxGPjBsZewgkCKg
 kmMKU0P7deGlkrWLUcz2vm0Lso+ieGm0IeLOQl9/rOLu3IQQFia0Vla7dLDgqF0P
 sz+udIiBztC3JPEmEZhfayA6P6e1TyWQUdquL08jp+DZD17gPqcaZDhHr62U5rmK
 7zoiZqkR03SbNaFhBhhoVOaAVcAnF0pSIgqzkCa3dVXxp1QV+JfD9CGR9NFyiIqB
 HK5RPFskIUCg0K2LSaAKbyoFWa/fJ8ZD8/CbFWcnXfWzoaaSkmc=
 =PjaF
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added two small interfaces: (a) GC_URGENT_LOW
  mode for performance and (b) F2FS_IOC_SEC_TRIM_FILE ioctl for
  security.

  The new GC mode allows Android to run some lower priority GCs in
  background, while new ioctl discards user information without race
  condition when the account is removed.

  In addition, some patches were merged to address latency-related
  issues. We've fixed some compression-related bug fixes as well as edge
  race conditions.

  Enhancements:
   - add GC_URGENT_LOW mode in gc_urgent
   - introduce F2FS_IOC_SEC_TRIM_FILE ioctl
   - bypass racy readahead to improve read latencies
   - shrink node_write lock coverage to avoid long latency

  Bug fixes:
   - fix missing compression flag control, i_size, and mount option
   - fix deadlock between quota writes and checkpoint
   - remove inode eviction path in synchronous path to avoid deadlock
   - fix to wait GCed compressed page writeback
   - fix a kernel panic in f2fs_is_compressed_page
   - check page dirty status before writeback
   - wait page writeback before update in node page write flow
   - fix a race condition between f2fs_write_end_io and f2fs_del_fsync_node_entry

  We've added some minor sanity checks and refactored trivial code
  blocks for better readability and debugging information"

* tag 'f2fs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
  f2fs: prepare a waiter before entering io_schedule
  f2fs: update_sit_entry: Make the judgment condition of f2fs_bug_on more intuitive
  f2fs: replace test_and_set/clear_bit() with set/clear_bit()
  f2fs: make file immutable even if releasing zero compression block
  f2fs: compress: disable compression mount option if compression is off
  f2fs: compress: add sanity check during compressed cluster read
  f2fs: use macro instead of f2fs verity version
  f2fs: fix deadlock between quota writes and checkpoint
  f2fs: correct comment of f2fs_exist_written_data
  f2fs: compress: delay temp page allocation
  f2fs: compress: fix to update isize when overwriting compressed file
  f2fs: space related cleanup
  f2fs: fix use-after-free issue
  f2fs: Change the type of f2fs_flush_inline_data() to void
  f2fs: add F2FS_IOC_SEC_TRIM_FILE ioctl
  f2fs: should avoid inode eviction in synchronous path
  f2fs: segment.h: delete a duplicated word
  f2fs: compress: fix to avoid memory leak on cc->cpages
  f2fs: use generic names for generic ioctls
  f2fs: don't keep meta inode pages used for compressed block migration
  ...
2020-08-10 18:33:22 -07:00
Chao Yu
1f0b067b6e f2fs: compress: disable compression mount option if compression is off
If CONFIG_F2FS_FS_COMPRESSION is off, don't allow to configure or
show compression related mount option.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-03 10:32:52 -07:00
Li Guifu
99c787cfd2 f2fs: fix use-after-free issue
During umount, f2fs_put_super() unregisters procfs entries after
f2fs_destroy_segment_manager(), it may cause use-after-free
issue when umount races with procfs accessing, fix it by relocating
f2fs_unregister_sysfs().

[Chao Yu: change commit title/message a bit]

Signed-off-by: Li Guifu <bluce.liguifu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-23 20:22:42 -07:00
Satya Tangirala
27aacd28ea f2fs: add inline encryption support
Wire up f2fs to support inline encryption via the helper functions which
fs/crypto/ now provides.  This includes:

- Adding a mount option 'inlinecrypt' which enables inline encryption
  on encrypted files where it can be used.

- Setting the bio_crypt_ctx on bios that will be submitted to an
  inline-encrypted file.

- Not adding logically discontiguous data to bios that will be submitted
  to an inline-encrypted file.

- Not doing filesystem-layer crypto on inline-encrypted files.

This patch includes a fix for a race during IPU by
Sahitya Tummala <stummala@codeaurora.org>

Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200702015607.1215430-4-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-07-08 10:29:43 -07:00
Jaegeuk Kim
6b12367da2 f2fs: avoid readahead race condition
If two readahead threads having same offset enter in readpages, every read
IOs are split and issued to the disk which giving lower bandwidth.

This patch tries to avoid redundant readahead calls.

Fixes one build error reported by Randy.
Fix build error when F2FS_FS_COMPRESSION is not set/enabled.
This label is needed in either case.

../fs/f2fs/data.c: In function ‘f2fs_mpage_readpages’:
../fs/f2fs/data.c:2327:5: error: label ‘next_page’ used but not defined
     goto next_page;

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-07 21:51:48 -07:00
Wang Xiaojun
ba87a45c23 f2fs: use kfree() to free variables allocated by match_strdup()
Use kfree() instead of kvfree() to free variables allocated
by match_strdup(). Because the memory is allocated with kmalloc
inside match_strdup().

Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18 12:37:47 -07:00
Denis Efremov
742532d11d f2fs: use kfree() instead of kvfree() to free superblock data
Use kfree() instead of kvfree() to free super in read_raw_super_block()
because the memory is allocated with kzalloc() in the function.
Use kfree() instead of kvfree() to free sbi, raw_super in
f2fs_fill_super() and f2fs_put_super() because the memory is allocated
with kzalloc().

Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18 12:33:18 -07:00
Linus Torvalds
42612e7763 f2fs-for-5.8-rc1
In this round, we've added some knobs to enhance compression feature and harden
 testing environment. In addition, we've fixed several bugs reported from Android
 devices such as long discarding latency, device hanging during quota_sync, etc.
 
 Enhancement:
 - support lzo-rle algorithm
 - add two ioctls to release and reserve blocks for compression
 - support partial truncation/fiemap on compressed file
 - introduce sysfs entries to attach IO flags explicitly
 - add iostat trace point along with read io stat
 
 Bug fix:
 - fix long discard latency
 - flush quota data by f2fs_quota_sync correctly
 - fix to recover parent inode number for power-cut recovery
 - fix lz4/zstd output buffer budget
 - parse checkpoint mount option correctly
 - avoid inifinite loop to wait for flushing node/meta pages
 - manage discard space correctly
 
 And some refactoring and clean up patches were added.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl7fojgACgkQQBSofoJI
 UNKaZQ//Rd6r7Z25SJkAoy+y/m6QDaKg4Ap1wR6+QmirR7HNxtpr3dXSVvmj4Xhu
 ZDJ3LHmerFiwR/X4zFPud+PAoBe3gJa2k7GT8q0g4YkgLy0hfX9PXt0t3I9F8vlk
 8m34j+hQaL9/3FBK4/PSG541vR/UUnwvu6t2pJMnz7rgnLej5I6yOIaoaihz7m+i
 k0ofK5ckuTNcZReAZ2tCIehQku7tDOBLdS5KxvBZBgRh0i5iSXXIa4ddvaMJdT/M
 WcjTZ6N8bFu0hCZ5hz9dyGGYo1XchQosLdLGhcEugsyxNp9Yuftyf5/Ie1wJNiEl
 ZsoRc15X7wfRPKKMMyDFljzPBPFiHr78p30uJ34bcYCu0j0CYi+gbKQztmEMZ2dy
 9M+sDG3jd5R7ACXrwS2ElSEDyLBnTaxbeSdCpErGjn/U19TLllbzhnMA9KR9elDI
 pEWgRc7DPmPbRZaStXMxIamf7pbmUSm0akAYbzGFvMHcSx4MXuQFICGK9t/mhSDm
 sO2b1Ir39yk65sVNdjFsnqDsi6jTPgrLSe3FY4eMhkn15OSiVGhcz7ddQMD7Fbuq
 WLpHFqER650I28i0EXh8bxzjkrj+aJQKhGcVbmwVS33MtKVfBdh4GfQMvS6MbeOM
 MsZ10E7Dr9ildKxqHP5SgLlggkl512lpj3+d6j0mUSSSUP2jtUw=
 =MiEC
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added some knobs to enhance compression feature
  and harden testing environment. In addition, we've fixed several bugs
  reported from Android devices such as long discarding latency, device
  hanging during quota_sync, etc.

  Enhancements:
   - support lzo-rle algorithm
   - add two ioctls to release and reserve blocks for compression
   - support partial truncation/fiemap on compressed file
   - introduce sysfs entries to attach IO flags explicitly
   - add iostat trace point along with read io stat

  Bug fixes:
   - fix long discard latency
   - flush quota data by f2fs_quota_sync correctly
   - fix to recover parent inode number for power-cut recovery
   - fix lz4/zstd output buffer budget
   - parse checkpoint mount option correctly
   - avoid inifinite loop to wait for flushing node/meta pages
   - manage discard space correctly

  And some refactoring and clean up patches were added"

* tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
  f2fs: attach IO flags to the missing cases
  f2fs: add node_io_flag for bio flags likewise data_io_flag
  f2fs: remove unused parameter of f2fs_put_rpages_mapping()
  f2fs: handle readonly filesystem in f2fs_ioc_shutdown()
  f2fs: avoid utf8_strncasecmp() with unstable name
  f2fs: don't return vmalloc() memory from f2fs_kmalloc()
  f2fs: fix retry logic in f2fs_write_cache_pages()
  f2fs: fix wrong discard space
  f2fs: compress: don't compress any datas after cp stop
  f2fs: remove unneeded return value of __insert_discard_tree()
  f2fs: fix wrong value of tracepoint parameter
  f2fs: protect new segment allocation in expand_inode_data
  f2fs: code cleanup by removing ifdef macro surrounding
  f2fs: avoid inifinite loop to wait for flushing node pages at cp_error
  f2fs: flush dirty meta pages when flushing them
  f2fs: fix checkpoint=disable:%u%%
  f2fs: compress: fix zstd data corruption
  f2fs: add compressed/gc data read IO stat
  f2fs: fix potential use-after-free issue
  f2fs: compress: don't handle non-compressed data in workqueue
  ...
2020-06-09 11:28:59 -07:00
Eric Biggers
0b6d4ca04a f2fs: don't return vmalloc() memory from f2fs_kmalloc()
kmalloc() returns kmalloc'ed memory, and kvmalloc() returns either
kmalloc'ed or vmalloc'ed memory.  But the f2fs wrappers, f2fs_kmalloc()
and f2fs_kvmalloc(), both return both kinds of memory.

It's redundant to have two functions that do the same thing, and also
breaking the standard naming convention is causing bugs since people
assume it's safe to kfree() memory allocated by f2fs_kmalloc().  See
e.g. the various allocations in fs/f2fs/compress.c.

Fix this by making f2fs_kmalloc() just use kmalloc().  And to avoid
re-introducing the allocation failures that the vmalloc fallback was
intended to fix, convert the largest allocations to use f2fs_kvmalloc().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:34:58 -07:00
Eric Biggers
ed318a6cc0 fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.

Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.

To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2".  The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.

To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.

To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount.  (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)

Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'.  On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.

Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-18 20:21:48 -07:00
Jaegeuk Kim
1ae18f71cb f2fs: fix checkpoint=disable:%u%%
When parsing the mount option, we don't have sbi->user_block_count.
Should do it after getting it.

Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-18 10:47:24 -07:00
Jaegeuk Kim
b4b10061ef f2fs: refactor resize_fs to avoid meta updates in progress
Sahitya raised an issue:
- prevent meta updates while checkpoint is in progress

allocate_segment_for_resize() can cause metapage updates if
it requires to change the current node/data segments for resizing.
Stop these meta updates when there is a checkpoint already
in progress to prevent inconsistent CP data.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11 20:37:13 -07:00
Konstantin Khlebnikov
baaa7ebf25 f2fs: report delalloc reserve as non-free in statfs for project quota
This reserved space isn't committed yet but cannot be used for
allocations. For userspace it has no difference from used space.

See the same fix in ext4 commit f06925c73942 ("ext4: report delalloc
reserve as non-free in statfs for project quota").

Fixes: ddc34e328d06 ("f2fs: introduce f2fs_statfs_project")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11 20:36:47 -07:00
Chao Yu
6d92b20103 f2fs: compress: support lzo-rle compress algorithm
LZO-RLE extension (run length encoding) was introduced to improve
performance of LZO algorithm in scenario of data contains many zeros,
zram has changed to use this extended algorithm by default, this
patch adds to support this algorithm extension, to enable this
extension, it needs to enable F2FS_FS_LZO and F2FS_FS_LZORLE config,
and specifies "compress_algorithm=lzo-rle" mountoption.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11 20:36:46 -07:00
Chao Yu
5e6bbde959 f2fs: introduce mempool for {,de}compress intermediate page allocation
If compression feature is on, in scenario of no enough free memory,
page refault ratio is higher than before, the root cause is:
- {,de}compression flow needs to allocate intermediate pages to store
compressed data in cluster, so during their allocation, vm may reclaim
mmaped pages.
- if above reclaimed pages belong to compressed cluster, during its
refault, it may cause more intermediate pages allocation, result in
reclaiming more mmaped pages.

So this patch introduces a mempool for intermediate page allocation,
in order to avoid high refault ratio, by default, number of
preallocated page in pool is 512, user can change the number by
assigning 'num_compress_pages' parameter during module initialization.

Ma Feng found warnings in the original patch and fixed like below.

Fix the following sparse warning:
fs/f2fs/compress.c:501:5: warning: symbol 'num_compress_pages' was not declared.
 Should it be static?
fs/f2fs/compress.c:530:6: warning: symbol 'f2fs_compress_free_page' was not
declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11 20:35:51 -07:00