IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
[ Upstream commit d80afefb17e01aa0c46a8eebc01882e0ebd8b0f6 ]
f2fs_inode_info.cp_task was introduced for FS_CP_DATA_IO accounting
since commit b0af6d491a6b ("f2fs: add app/fs io stat").
However, cp_task usage coverage has been increased due to below
commits:
commit 040d2bb318d1 ("f2fs: fix to avoid deadloop if data_flush is on")
commit 186857c5a14a ("f2fs: fix potential recursive call when enabling data_flush")
So that, if data_flush mountoption is on, when data flush was
triggered from background, the IO from data flush will be accounted
as checkpoint IO type incorrectly.
In order to fix this issue, this patch splits cp_task into two:
a) cp_task: used for IO accounting
b) wb_task: used to avoid deadlock
Fixes: 040d2bb318d1 ("f2fs: fix to avoid deadloop if data_flush is on")
Fixes: 186857c5a14a ("f2fs: fix potential recursive call when enabling data_flush")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 677a82b44ebf263d4f9a0cfbd576a6ade797a07b upstream.
Yanming reported a kernel bug in Bugzilla kernel [1], which can be
reproduced. The bug message is:
The kernel message is shown below:
kernel BUG at fs/inode.c:611!
Call Trace:
evict+0x282/0x4e0
__dentry_kill+0x2b2/0x4d0
dput+0x2dd/0x720
do_renameat2+0x596/0x970
__x64_sys_rename+0x78/0x90
do_syscall_64+0x3b/0x90
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215895
The bug is due to fuzzed inode has both inline_data and encrypted flags.
During f2fs_evict_inode(), as the inode was deleted by rename(), it
will cause inline data conversion due to conflicting flags. The page
cache will be polluted and the panic will be triggered in clear_inode().
Try fixing the bug by doing more sanity checks for inline data inode in
sanity_check_inode().
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b5639bb4313b9d455fc9fc4768d23a5e4ca8cb9d upstream.
Tryng to rename a directory that has all following properties fails with
EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))'
in f2fs_match_ci_name():
- The directory is casefolded
- The directory is encrypted
- The directory's encryption key is not yet set up
- The parent directory is *not* encrypted
The problem is incorrect handling of the lookup of ".." to get the
parent reference to update. fscrypt_setup_filename() treats ".." (and
".") specially, as it's never encrypted. It's passed through as-is, and
setting up the directory's key is not attempted. As the name isn't a
no-key name, f2fs treats it as a "normal" name and attempts a casefolded
comparison. That breaks the assumption of the WARN_ON_ONCE() in
f2fs_match_ci_name() which assumes that for encrypted directories,
casefolded comparisons only happen when the directory's key is set up.
We could just remove this WARN_ON_ONCE(). However, since casefolding is
always a no-op on "." and ".." anyway, let's instead just not casefold
these names. This results in the standard bytewise comparison.
Fixes: 7ad08a58bf67 ("f2fs: Handle casefolding with Encryption")
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6b8beca0edd32075a769bfe4178ca00c0dcd22a9 upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215916
The kernel message is shown below:
kernel BUG at fs/f2fs/segment.c:2560!
Call Trace:
allocate_segment_by_default+0x228/0x440
f2fs_allocate_data_block+0x13d1/0x31f0
do_write_page+0x18d/0x710
f2fs_outplace_write_data+0x151/0x250
f2fs_do_write_data_page+0xef9/0x1980
move_data_page+0x6af/0xbc0
do_garbage_collect+0x312f/0x46f0
f2fs_gc+0x6b0/0x3bc0
f2fs_balance_fs+0x921/0x2260
f2fs_write_single_data_page+0x16be/0x2370
f2fs_write_cache_pages+0x428/0xd00
f2fs_write_data_pages+0x96e/0xd50
do_writepages+0x168/0x550
__writeback_single_inode+0x9f/0x870
writeback_sb_inodes+0x47d/0xb20
__writeback_inodes_wb+0xb2/0x200
wb_writeback+0x4bd/0x660
wb_workfn+0x5f3/0xab0
process_one_work+0x79f/0x13e0
worker_thread+0x89/0xf60
kthread+0x26a/0x300
ret_from_fork+0x22/0x30
RIP: 0010:new_curseg+0xe8d/0x15f0
The root cause is: ckpt.valid_block_count is inconsistent with SIT table,
stat info indicates filesystem has free blocks, but SIT table indicates
filesystem has no free segment.
So that during garbage colloection, it triggers panic when LFS allocator
fails to find free segment.
This patch tries to fix this issue by checking consistency in between
ckpt.valid_block_count and block accounted from SIT.
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4d17e6fe9293d57081ffdc11e1cf313e25e8fd9e upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215897
I have encountered a bug in F2FS file system in kernel v5.17.
The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can
reproduce the bug by running the following commands:
The kernel message is shown below:
kernel BUG at fs/f2fs/f2fs.h:2511!
Call Trace:
f2fs_remove_inode_page+0x2a2/0x830
f2fs_evict_inode+0x9b7/0x1510
evict+0x282/0x4e0
do_unlinkat+0x33a/0x540
__x64_sys_unlinkat+0x8e/0xd0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is: .total_valid_block_count or .total_valid_node_count
could fuzzed to zero, then once dec_valid_node_count() was called, it
will cause BUG_ON(), this patch fixes to print warning info and set
SBI_NEED_FSCK into CP instead of panic.
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 10a26878564f27327b12e8f4b4d8d7b43065fae5 ]
This patch adds a new function f2fs_dquot_initialize() to wrap
dquot_initialize(), and it supports to inject fault into
f2fs_dquot_initialize() to simulate inner failure occurs in
dquot_initialize().
Usage:
a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=65536 <dev> <mountpoint>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 02d58cd253d7536c412993573fc6b3b4454960eb ]
Compresse file and normal file has differ in i_addr addressing,
specifically addrs per inode/block. So, we will face data loss, if we
disable the compression flag on non-empty files. Therefore we should
disallow not only enabling but disabling the compression flag on
non-empty files.
Fixes: 4c8ff7095bef ("f2fs: support data compression")
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Hyeong-Jun Kim <hj514.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
In this cycle, we've addressed some performance issues such as lock contention,
misbehaving compress_cache, allowing extent_cache for compressed files, and new
sysfs to adjust ra_size for fadvise. In order to diagnose the performance issues
quickly, we also added an iostat which shows the IO latencies periodically. On
the stability side, we've found two memory leakage cases in the error path in
compression flow. And, we've also fixed various corner cases in fiemap, quota,
checkpoint=disable, zstd, and so on.
Enhancement:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fix:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code paths such
as dio, unnecessary operations, debugfs/f2fs/status, sanity check, and typos.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmEyw1sACgkQQBSofoJI
UNLJmA/+NHUgwUjLMcHvmLyp6QYpQDZtKj93/sRDo+YHOYNdYFjWWUb329PYTKWS
kEdzApCP+KHfVxeSkiL/x3qWP+RlTkIf96P0kR3/BKi0tjg25G2riFWztusDDFpt
xi+AW5sUFDvIx1tFumvQHAQedSwBgcZ96ovT5EwxEuONkljhZC9phEC6vSXz9nOR
e2EQIyezbC5O21np1KSeqSgqRMpVkJkVcEHy4VmpMBCLMOOYPepWwKw+yPaV/jR/
zUXdo2/53vma50M5LCDPCtjCtWQgLoeNeGLxyjfzQuTJU6TmtPY65JObLPt6pUSj
fRW6qIziTZbVYXzOWBD0EYilv2N4c3BNJdhQCpx2Vyjw9/LLxzqKPOUyzBoa1kjY
eZVvmaLXVCKsoJdHDSi7OH/4BqS6SuSZE8eO/nGkgswqiErHZ0Vwl3bFCWC7r/Bk
r2U5spJx/83XO6c9H1bzeWEies1DRtwnCDIRRuw35RtJ4uHZaqCfkuJ7rOBwC90X
4SpaAKdUxP2RWc3GKELBIhaqPn7vyMy9ile6VU14PjM8UcY5hyE87T2azqR8gGut
nVjRL4cbMGTPj6m1Qj8KqBRSaLuShe6AncUy7bNGiM+JlcLcdB6OJ1ZYLl9hjx2r
TbIouXThgcZ4SIK0DEaBLKz2b9/0TfaO9gw1XzpRma+bWA1pApM=
=W67o
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.
In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.
On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.
Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"
* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...
Don't create discard thread when device doesn't support realtime discard
or user specifies nodiscard mount option.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If all free_nat_bitmap are available, we can rebuild nat_bits from
free_nat_bitmap entirely during umount, let's make another chance
to reenable nat_bits for image.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_sanity_check_cluster() to support doing
sanity check on cluster of compressed file, it will be triggered
from below two paths:
- __f2fs_cluster_blocks()
- f2fs_map_blocks(F2FS_GET_BLOCK_FIEMAP)
And it can detect below three kind of cluster insanity status.
C: COMPRESS_ADDR
N: NULL_ADDR or NEW_ADDR
V: valid blkaddr
*: any value
1. [*|C|*|*]
2. [C|*|C|*]
3. [C|N|N|V]
Signed-off-by: Chao Yu <chao@kernel.org>
[Nathan Chancellor: fix missing inline warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The value of FAULT_* macros and its description in f2fs.rst became
inconsistent, fix this to keep compatibility of fault injection
interface.
Fixes: 67883ade7a98 ("f2fs: remove FAULT_ALLOC_BIO")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to inject fault into f2fs_kmem_cache_alloc().
Usage:
a) echo 32768 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=32768 <dev> <mountpoint>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Compressed inode may suffer read performance issue due to it can not
use extent cache, so I propose to add this unaligned extent support
to improve it.
Currently, it only works in readonly format f2fs image.
Unaligned extent: in one compressed cluster, physical block number
will be less than logical block number, so we add an extra physical
block length in extent info in order to indicate such extent status.
The idea is if one whole cluster blocks are contiguous physically,
once its mapping info was readed at first time, we will cache an
unaligned (or aligned) extent info entry in extent cache, it expects
that the mapping info will be hitted when rereading cluster.
Merge policy:
- Aligned extents can be merged.
- Aligned extent and unaligned extent can not be merged.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During f2fs_write_checkpoint(), once we failed in
f2fs_flush_nat_entries() or do_checkpoint(), metadata of filesystem
such as prefree bitmap, nat/sit version bitmap won't be recovered,
it may cause f2fs image to be inconsistent, let's just set CP error
flag to avoid further updates until we figure out a scheme to rollback
all metadatas in such condition.
Reported-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fadvise() allows the user to expand the readahead window to double with
POSIX_FADV_SEQUENTIAL, now. But, in some use cases, it is not that
sufficient and we need to meet the need in a restricted way. We can
control the multiplier value of bdi device readahead between 2 (default)
and 256 for POSIX_FADV_SEQUENTIAL advise option.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can just check f2fs_lfs_mode() directly. The block_unaligned_IO()
check is redundant because in LFS mode, f2fs doesn't do direct I/O
writes that aren't block-aligned (due to f2fs_force_buffered_io()
returning true in this case, triggering the fallback to buffered I/O).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As 5kft reported in [1]:
set_compress_context() should set compress level into .i_compress_flag
for zstd as well as lz4hc, otherwise, zstd compressor will still use
default zstd compress level during compression, fix it.
[1] https://lore.kernel.org/linux-f2fs-devel/8e29f52b-6b0d-45ec-9520-e63eb254287a@www.fastmail.com/T/#u
Fixes: 3fde13f817e2 ("f2fs: compress: support compress level")
Reported-by: 5kft <5kft@5kft.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added gc_reclaimed_segments and gc_segment_mode sysfs nodes.
1) "gc_reclaimed_segments" shows how many segments have been
reclaimed by GC during a specific GC mode.
2) "gc_segment_mode" is used to control for which gc mode
the "gc_reclaimed_segments" node shows.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use invalidate_lock instead of f2fs' private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: Chao Yu <yuchao0@huawei.com>
CC: linux-f2fs-devel@lists.sourceforge.net
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When we create a directory with enable compression, all file write into
directory will try to compress.But sometimes we may know, new file
cannot meet compression ratio requirements.
We need a nocompress extension to skip those files to avoid unnecessary
compress page test.
After add nocompress_extension, the priority should be:
dir_flag < comp_extention,nocompress_extension < comp_file_flag,
no_comp_file_flag.
Priority in between FS_COMPR_FL, FS_NOCOMP_FS, extensions:
* compress_extension=so; nocompress_extension=zip; chattr +c dir;
touch dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so
and baz.txt should be compresse, bar.zip should be non-compressed.
chattr +c dir/bar.zip can enable compress on bar.zip.
* compress_extension=so; nocompress_extension=zip; chattr -c dir;
touch dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so
should be compresse, bar.zip and baz.txt should be non-compressed.
chattr+c dir/bar.zip; chattr+c dir/baz.txt; can enable compress on
bar.zip and baz.txt.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to migrate swapfile in aligned write mode during
swapon in order to keep swapfile being aligned to section as much as
possible, then pinned swapfile will locates fully filled section which
may not affected by GC.
However, for the case that swapfile's size is not aligned to section
size, it will still leave last extent in file's tail as unaligned due
to its size is smaller than section size, like case #2.
case #1
xfs_io -f /mnt/f2fs/file -c "pwrite 0 4M" -c "fsync"
Before swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..3047]: 1123352..1126399 3048 0x1000
1: [3048..7143]: 237568..241663 4096 0x1000
2: [7144..8191]: 245760..246807 1048 0x1001
After swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..8191]: 249856..258047 8192 0x1001
Kmsg:
F2FS-fs (zram0): Swapfile (2) is not align to section:
1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(2097152 * n)
case #2
xfs_io -f /mnt/f2fs/file -c "pwrite 0 3M" -c "fsync"
Before swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..3047]: 246808..249855 3048 0x1000
1: [3048..6143]: 237568..240663 3096 0x1001
After swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 258048..262143 4096 0x1000
1: [4096..6143]: 238616..240663 2048 0x1001
Kmsg:
F2FS-fs (zram0): Swapfile: last extent is not aligned to section
F2FS-fs (zram0): Swapfile (2) is not align to section:
1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(2097152 * n)
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Support to use address space of inner inode to cache compressed block,
in order to improve cache hit ratio of random read.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's create /sys/fs/f2fs/<disk>/feature_list/ to meet sysfs rule.
Note that there are three feature list entries:
1) /sys/fs/f2fs/features
: shows runtime features supported by in-kernel f2fs along with Kconfig.
- ref. F2FS_FEATURE_RO_ATTR()
2) /sys/fs/f2fs/$s_id/features <deprecated>
: shows on-disk features enabled by mkfs.f2fs, used for old kernels. This
won't add new feature anymore, and thus, users should check entries in 3)
instead of this 2).
3) /sys/fs/f2fs/$s_id/feature_list
: shows on-disk features enabled by mkfs.f2fs per instance, which follows
sysfs entry rule where each entry should expose single value.
This list covers old feature list provided by 2) and beyond. Therefore,
please add new on-disk feature in this list only.
- ref. F2FS_SB_FEATURE_RO_ATTR()
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Given RO feature in superblock, we don't need to check provisioning/reserve
spaces and SSA area.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Once we release compressed blocks, we used to set IMMUTABLE bit. But it turned
out it disallows every fs operations which we don't need for compression.
Let's just prevent writing data only.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As syzbot reported, there is an use-after-free issue during f2fs recovery:
Use-after-free write at 0xffff88823bc16040 (in kfence-#10):
kmem_cache_destroy+0x1f/0x120 mm/slab_common.c:486
f2fs_recover_fsync_data+0x75b0/0x8380 fs/f2fs/recovery.c:869
f2fs_fill_super+0x9393/0xa420 fs/f2fs/super.c:3945
mount_bdev+0x26c/0x3a0 fs/super.c:1367
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1497
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0x196f/0x2be0 fs/namespace.c:3235
do_mount fs/namespace.c:3248 [inline]
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is multi f2fs filesystem instances can race on accessing
global fsync_entry_slab pointer, result in use-after-free issue of slab
cache, fixes to init/destroy this slab cache only once during module
init/destroy procedure to avoid this issue.
Reported-by: syzbot+9d90dad32dd9727ed084@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Restruct f2fs page private layout for below reasons:
There are some cases that f2fs wants to set a flag in a page to
indicate a specified status of page:
a) page is in transaction list for atomic write
b) page contains dummy data for aligned write
c) page is migrating for GC
d) page contains inline data for inline inode flush
e) page belongs to merkle tree, and is verified for fsverity
f) page is dirty and has filesystem/inode reference count for writeback
g) page is temporary and has decompress io context reference for compression
There are existed places in page structure we can use to store
f2fs private status/data:
- page.flags: PG_checked, PG_private
- page.private
However it was a mess when we using them, which may cause potential
confliction:
page.private PG_private PG_checked page._refcount (+1 at most)
a) -1 set +1
b) -2 set
c), d), e) set
f) 0 set +1
g) pointer set
The other problem is page.flags has no free slot, if we can avoid set
zero to page.private and set PG_private flag, then we use non-zero value
to indicate PG_private status, so that we may have chance to reclaim
PG_private slot for other usage. [1]
The other concern is f2fs has bad scalability in aspect of indicating
more page status.
So in this patch, let's restructure f2fs' page.private as below to
solve above issues:
Layout A: lowest bit should be 1
| bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... |
bit 0 PAGE_PRIVATE_NOT_POINTER
bit 1 PAGE_PRIVATE_ATOMIC_WRITE
bit 2 PAGE_PRIVATE_DUMMY_WRITE
bit 3 PAGE_PRIVATE_ONGOING_MIGRATION
bit 4 PAGE_PRIVATE_INLINE_INODE
bit 5 PAGE_PRIVATE_REF_RESOURCE
bit 6- f2fs private data
Layout B: lowest bit should be 0
page.private is a wrapped pointer.
After the change:
page.private PG_private PG_checked page._refcount (+1 at most)
a) 11 set +1
b) 101 set +1
c) 1001 set +1
d) 10001 set +1
e) set
f) 100001 set +1
g) pointer set +1
[1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_destroy_compress_ctx(), after f2fs_destroy_compress_ctx(),
cc.cluster_idx will be cleared w/ NULL_CLUSTER, f2fs_cluster_blocks()
may check wrong cluster metadata, fix it.
Fixes: 4c8ff7095bef ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we added a new mount option, "checkpoint_merge", which introduces
a kernel thread dealing with the f2fs checkpoints. Once we start to manage the
IO priority along with blk-cgroup, the checkpoint operation can be processed in
a lower priority under the process context. Since the checkpoint holds all the
filesystem operations, we give a higher priority to the checkpoint thread all
the time.
Enhancement:
- introduce gc_merge mount option to introduce a checkpoint thread
- improve to run discard thread efficiently
- allow modular compression algorithms
- expose # of overprivision segments to sysfs
- expose runtime compression stat to sysfs
Bug fix:
- fix OOB memory access by the node id lookup
- avoid touching checkpointed data in the checkpoint-disabled mode
- fix the resizing flow to avoid kernel panic and race conditions
- fix block allocation issues on pinned files
- address some swapfile issues
- fix hugtask problem and kernel panic during atomic write operations
- don't start checkpoint thread in RO
And, we've cleaned up some kernel coding style and build warnings. In addition,
we fixed some minor race conditions and error handling routines.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmCQhVIACgkQQBSofoJI
UNIggA/8DZINzFLMCj6+6P5wNAWj3nYtx/FnwZ7C31f8qkiZjgA4LfONUnDvV7sU
GS8MuLQz4eTYfqU2rVgGiSm+aCkEOovnk7C7Huo7pezgqYb+5J6ACXsqdU3dcD5M
kShJMqLKcTKqtOMbnJrdGvmw1/ysuAi7UhSSgVV+9NQxlhxADnagOGbQ7lXNSV3R
spGMWazGY2uA5DFCCa4lMX79lyFATCzEKB3SKAW5r+8QSmxJY8ViK2Er7AnvwRJz
XJ/QJ8ALNb/GGyHzBWFv3P6Yxo/G3FkUvTIc5Rhi9P2lUgjI2NALokj7AOnfNh4a
uSXHVlNrrfH+gpx9xr5z8MUmroCYCCOJ6EhnVweqViRmekY8jSb2HxmUtDTIf19U
LWl3gtD2GDQx6CY0a0K58Oa2Lp0Bp9MWUdPA/4P21EymZwXum7aCkhV+DnigcoCj
yCmKlI8nIpCS97dIO/7MsnG6Tu/7c+Prytd2ezUo+6hlkXPZk8shs+elnjlWu/6V
3ZpSWKzQsPJL8U7eB9H04AxEokrrXm3fRhR86C7JdkEe0gyGFf3dB/G+jqzcNi5m
ZpxiCeZ8RbNmPpnH8NWeYHk9uDKMOXuUPFYDoaOwImNWfqj0jfhiTxHo4MyBLAuk
MT632ICcuJLvwgnSbMAI3U7v6+dZXKH4y6U7IHFjxhI7beMzxmI=
=P5uk
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we added a new mount option, "checkpoint_merge", which
introduces a kernel thread dealing with the f2fs checkpoints. Once we
start to manage the IO priority along with blk-cgroup, the checkpoint
operation can be processed in a lower priority under the process
context. Since the checkpoint holds all the filesystem operations, we
give a higher priority to the checkpoint thread all the time.
Enhancements:
- introduce gc_merge mount option to introduce a checkpoint thread
- improve to run discard thread efficiently
- allow modular compression algorithms
- expose # of overprivision segments to sysfs
- expose runtime compression stat to sysfs
Bug fixes:
- fix OOB memory access by the node id lookup
- avoid touching checkpointed data in the checkpoint-disabled mode
- fix the resizing flow to avoid kernel panic and race conditions
- fix block allocation issues on pinned files
- address some swapfile issues
- fix hugtask problem and kernel panic during atomic write operations
- don't start checkpoint thread in RO
And, we've cleaned up some kernel coding style and build warnings. In
addition, we fixed some minor race conditions and error handling
routines"
* tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (48 commits)
f2fs: drop inplace IO if fs status is abnormal
f2fs: compress: remove unneed check condition
f2fs: clean up left deprecated IO trace codes
f2fs: avoid using native allocate_segment_by_default()
f2fs: remove unnecessary struct declaration
f2fs: fix to avoid NULL pointer dereference
f2fs: avoid duplicated codes for cleanup
f2fs: document: add description about compressed space handling
f2fs: clean up build warnings
f2fs: fix the periodic wakeups of discard thread
f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
f2fs: fix to avoid GC/mmap race with f2fs_truncate()
f2fs: set checkpoint_merge by default
f2fs: Fix a hungtask problem in atomic write
f2fs: fix to restrict mount condition on readonly block device
f2fs: introduce gc_merge mount option
f2fs: fix to cover __allocate_new_section() with curseg_lock
f2fs: fix wrong alloc_type in f2fs_do_replace_block
f2fs: delete empty compress.h
f2fs: fix a typo in inode.c
...
Commit d5f7bc0064e0 ("f2fs: deprecate f2fs_trace_io") left some
dead codes, delete them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As we did for other cases, in fix_curseg_write_pointer(), let's
use wrapped f2fs_allocate_new_section() instead of native
allocate_segment_by_default(), by this way, it fixes to cover
segment allocation with curseg_lock and sentry_lock.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
struct dnode_of_data is defined at 897th line.
The declaration here is unnecessary. Remove it.
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch, we will add two new mount options: "gc_merge" and
"nogc_merge", when background_gc is on, "gc_merge" option can be
set to let background GC thread to handle foreground GC requests,
it can eliminate the sluggish issue caused by slow foreground GC
operation when GC is triggered from a process with limited I/O
and CPU resources.
Original idea is from Xiang.
Signed-off-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In CP disabling mode, there are two issues when using LFS or SSR | AT_SSR
mode to select victim:
1. LFS is set to find source section during GC, the victim should have
no checkpointed data, since after GC, section could not be set free for
reuse.
Previously, we only check valid chpt blocks in current segment rather
than section, fix it.
2. SSR | AT_SSR are set to find target segment for writes which can be
fully filled by checkpointed and newly written blocks, we should never
select such segment, otherwise it can cause panic or data corruption
during allocation, potential case is described as below:
a) target segment has 'n' (n < 512) ckpt valid blocks
b) GC migrates 'n' valid blocks to other segment (segment is still
in dirty list)
c) GC migrates '512 - n' blocks to target segment (segment has 'n'
cp_vblocks and '512 - n' vblocks)
d) If GC selects target segment via {AT,}SSR allocator, however there
is no free space in targe segment.
Fixes: 4354994f097d ("f2fs: checkpoint disabling")
Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I've added new sysfs nodes to show runtime compression stat since mount.
compr_written_block - show the block count written after compression
compr_saved_block - show the saved block count with compression
compr_new_inode - show the count of inode newly enabled for compression
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, fallocate() on a pinned file only allocates blocks which aligns
to segment rather than section, so GC may try to migrate pinned file's
block, and after several times of failure, pinned file's block could
be migrated to other place, however user won't be aware of such
condition, and then old obsolete block address may be readed/written
incorrectly.
To avoid such condition, let's try to allocate pinned file's blocks
with section alignment.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- file_clear_encrypt() was never be used, remove it.
- In addition, relocating macros for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now swapfile in f2fs directly submit IO to blockdev according to
swapfile extents reported by f2fs when swapon, therefore there is
no need to check IS_SWAPFILE when exec filesystem operation.
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>