2018-09-12 09:16:07 +08:00
// SPDX-License-Identifier: GPL-2.0
2012-11-29 13:28:09 +09:00
/*
2012-11-02 17:07:47 +09:00
* fs / f2fs / super . c
*
* Copyright ( c ) 2012 Samsung Electronics Co . , Ltd .
* http : //www.samsung.com/
*/
# include <linux/module.h>
# include <linux/init.h>
# include <linux/fs.h>
2022-06-06 15:32:41 -07:00
# include <linux/fs_context.h>
mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-14 14:07:14 -08:00
# include <linux/sched/mm.h>
2012-11-02 17:07:47 +09:00
# include <linux/statfs.h>
# include <linux/buffer_head.h>
# include <linux/kthread.h>
# include <linux/parser.h>
# include <linux/mount.h>
# include <linux/seq_file.h>
2013-06-28 12:47:01 +09:00
# include <linux/proc_fs.h>
2012-11-02 17:07:47 +09:00
# include <linux/random.h>
# include <linux/exportfs.h>
2013-03-17 17:26:14 +09:00
# include <linux/blkdev.h>
2017-07-09 00:13:07 +08:00
# include <linux/quotaops.h>
2012-11-02 17:07:47 +09:00
# include <linux/f2fs_fs.h>
2013-08-04 23:09:40 +09:00
# include <linux/sysfs.h>
2017-08-08 10:54:31 +08:00
# include <linux/quota.h>
2019-07-23 16:05:28 -07:00
# include <linux/unicode.h>
2020-03-25 16:48:42 +01:00
# include <linux/part_stat.h>
2021-01-22 17:46:43 +08:00
# include <linux/zstd.h>
# include <linux/lz4.h>
2012-11-02 17:07:47 +09:00
# include "f2fs.h"
# include "node.h"
2013-03-31 13:26:03 +09:00
# include "segment.h"
2012-11-02 17:07:47 +09:00
# include "xattr.h"
2013-08-04 23:09:40 +09:00
# include "gc.h"
2021-08-19 20:52:28 -07:00
# include "iostat.h"
2012-11-02 17:07:47 +09:00
2013-04-20 01:28:40 +09:00
# define CREATE_TRACE_POINTS
# include <trace/events/f2fs.h>
2012-11-02 17:07:47 +09:00
static struct kmem_cache * f2fs_inode_cachep ;
2016-04-29 15:34:32 -07:00
# ifdef CONFIG_F2FS_FAULT_INJECTION
2016-04-29 15:49:56 -07:00
2018-11-24 12:06:42 +03:00
const char * f2fs_fault_name [ FAULT_MAX ] = {
2016-04-29 15:49:56 -07:00
[ FAULT_KMALLOC ] = " kmalloc " ,
2017-11-30 19:28:18 +08:00
[ FAULT_KVMALLOC ] = " kvmalloc " ,
2016-04-29 16:17:09 -07:00
[ FAULT_PAGE_ALLOC ] = " page alloc " ,
2017-10-28 16:52:30 +08:00
[ FAULT_PAGE_GET ] = " page get " ,
2016-04-29 16:29:22 -07:00
[ FAULT_ALLOC_NID ] = " alloc nid " ,
[ FAULT_ORPHAN ] = " orphan " ,
[ FAULT_BLOCK ] = " no more block " ,
[ FAULT_DIR_DEPTH ] = " too big dir depth " ,
2016-05-25 15:24:18 -07:00
[ FAULT_EVICT_INODE ] = " evict_inode fail " ,
2017-03-09 15:24:24 -08:00
[ FAULT_TRUNCATE ] = " truncate fail " ,
2018-09-12 09:22:29 +08:00
[ FAULT_READ_IO ] = " read IO error " ,
2016-09-26 19:45:55 +08:00
[ FAULT_CHECKPOINT ] = " checkpoint error " ,
2018-08-06 20:30:18 +08:00
[ FAULT_DISCARD ] = " discard error " ,
2018-09-12 09:22:29 +08:00
[ FAULT_WRITE_IO ] = " write IO error " ,
2021-08-09 08:24:48 +08:00
[ FAULT_SLAB_ALLOC ] = " slab alloc " ,
2021-10-28 21:03:05 +08:00
[ FAULT_DQUOT_INIT ] = " dquot initialize " ,
2021-12-12 17:17:51 +08:00
[ FAULT_LOCK_OP ] = " lock_op " ,
2022-10-06 23:09:28 +08:00
[ FAULT_BLKADDR ] = " invalid blkaddr " ,
2016-04-29 15:49:56 -07:00
} ;
2016-05-16 12:38:50 +08:00
2018-08-08 17:36:41 +08:00
void f2fs_build_fault_attr ( struct f2fs_sb_info * sbi , unsigned int rate ,
unsigned int type )
2016-05-16 12:38:50 +08:00
{
2018-03-08 14:22:56 +08:00
struct f2fs_fault_info * ffi = & F2FS_OPTION ( sbi ) . fault_info ;
2016-09-23 21:30:09 +08:00
2016-05-16 12:38:50 +08:00
if ( rate ) {
2016-09-23 21:30:09 +08:00
atomic_set ( & ffi - > inject_ops , 0 ) ;
ffi - > inject_rate = rate ;
2016-05-16 12:38:50 +08:00
}
2018-08-08 17:36:41 +08:00
if ( type )
ffi - > inject_type = type ;
if ( ! rate & & ! type )
memset ( ffi , 0 , sizeof ( struct f2fs_fault_info ) ) ;
2016-05-16 12:38:50 +08:00
}
2016-04-29 15:34:32 -07:00
# endif
2015-06-19 12:01:21 -07:00
/* f2fs-wide shrinker description */
static struct shrinker f2fs_shrinker_info = {
. scan_objects = f2fs_shrink_scan ,
. count_objects = f2fs_shrink_count ,
. seeks = DEFAULT_SEEKS ,
} ;
2012-11-02 17:07:47 +09:00
enum {
2013-06-16 09:48:48 +09:00
Opt_gc_background ,
2012-11-02 17:07:47 +09:00
Opt_disable_roll_forward ,
2015-01-23 18:33:46 -08:00
Opt_norecovery ,
2012-11-02 17:07:47 +09:00
Opt_discard ,
2016-07-03 22:05:14 +08:00
Opt_nodiscard ,
2012-11-02 17:07:47 +09:00
Opt_noheap ,
2017-03-24 20:41:45 -04:00
Opt_heap ,
2013-10-07 11:36:20 +09:00
Opt_user_xattr ,
2012-11-02 17:07:47 +09:00
Opt_nouser_xattr ,
2013-10-07 11:36:20 +09:00
Opt_acl ,
2012-11-02 17:07:47 +09:00
Opt_noacl ,
Opt_active_logs ,
Opt_disable_ext_identify ,
2013-08-08 15:16:22 +09:00
Opt_inline_xattr ,
2017-02-15 10:34:45 +08:00
Opt_noinline_xattr ,
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
Opt_inline_xattr_size ,
2013-11-10 23:13:17 +08:00
Opt_inline_data ,
2014-09-24 18:16:13 +08:00
Opt_inline_dentry ,
2016-05-09 19:56:34 +08:00
Opt_noinline_dentry ,
2014-04-02 15:34:36 +09:00
Opt_flush_merge ,
2016-05-20 22:39:20 -07:00
Opt_noflush_merge ,
2022-10-25 01:54:01 +08:00
Opt_barrier ,
2014-07-23 09:57:31 -07:00
Opt_nobarrier ,
2014-10-30 22:47:03 -07:00
Opt_fastboot ,
2015-02-05 17:55:51 +08:00
Opt_extent_cache ,
2015-06-25 17:43:04 -07:00
Opt_noextent_cache ,
2015-03-24 10:20:27 +08:00
Opt_noinline_data ,
2015-12-16 13:12:16 +08:00
Opt_data_flush ,
2017-12-27 15:05:52 -08:00
Opt_reserve_root ,
2018-01-04 21:36:09 -08:00
Opt_resgid ,
Opt_resuid ,
2016-06-03 19:29:38 -07:00
Opt_mode ,
2016-12-21 17:09:19 -08:00
Opt_io_size_bits ,
2016-04-29 15:34:32 -07:00
Opt_fault_injection ,
2018-08-08 17:36:41 +08:00
Opt_fault_type ,
2016-05-20 21:47:24 -07:00
Opt_lazytime ,
Opt_nolazytime ,
2017-08-08 10:54:31 +08:00
Opt_quota ,
Opt_noquota ,
2017-07-09 00:13:07 +08:00
Opt_usrquota ,
Opt_grpquota ,
2017-07-26 00:01:41 +08:00
Opt_prjquota ,
2017-08-08 10:54:31 +08:00
Opt_usrjquota ,
Opt_grpjquota ,
Opt_prjjquota ,
Opt_offusrjquota ,
Opt_offgrpjquota ,
Opt_offprjjquota ,
Opt_jqfmt_vfsold ,
Opt_jqfmt_vfsv0 ,
Opt_jqfmt_vfsv1 ,
2018-02-18 08:50:49 -08:00
Opt_alloc ,
2018-03-07 12:07:49 +08:00
Opt_fsync ,
2018-03-15 18:51:42 +08:00
Opt_test_dummy_encryption ,
2020-07-02 01:56:06 +00:00
Opt_inlinecrypt ,
2019-05-29 17:49:06 -07:00
Opt_checkpoint_disable ,
Opt_checkpoint_disable_cap ,
Opt_checkpoint_disable_cap_perc ,
Opt_checkpoint_enable ,
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
Opt_checkpoint_merge ,
Opt_nocheckpoint_merge ,
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
Opt_compress_algorithm ,
Opt_compress_log_size ,
Opt_compress_extension ,
2021-06-08 19:15:08 +08:00
Opt_nocompress_extension ,
2020-11-26 18:32:09 +08:00
Opt_compress_chksum ,
2020-12-01 13:08:02 +09:00
Opt_compress_mode ,
2021-05-20 19:51:50 +08:00
Opt_compress_cache ,
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
Opt_atgc ,
2021-03-27 17:57:06 +08:00
Opt_gc_merge ,
Opt_nogc_merge ,
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
Opt_discard_unit ,
2022-06-20 10:38:42 -07:00
Opt_memory_mode ,
2022-12-01 17:37:15 -08:00
Opt_age_extent_cache ,
2012-11-02 17:07:47 +09:00
Opt_err ,
} ;
static match_table_t f2fs_tokens = {
2013-06-16 09:48:48 +09:00
{ Opt_gc_background , " background_gc=%s " } ,
2012-11-02 17:07:47 +09:00
{ Opt_disable_roll_forward , " disable_roll_forward " } ,
2015-01-23 18:33:46 -08:00
{ Opt_norecovery , " norecovery " } ,
2012-11-02 17:07:47 +09:00
{ Opt_discard , " discard " } ,
2016-07-03 22:05:14 +08:00
{ Opt_nodiscard , " nodiscard " } ,
2012-11-02 17:07:47 +09:00
{ Opt_noheap , " no_heap " } ,
2017-03-24 20:41:45 -04:00
{ Opt_heap , " heap " } ,
2013-10-07 11:36:20 +09:00
{ Opt_user_xattr , " user_xattr " } ,
2012-11-02 17:07:47 +09:00
{ Opt_nouser_xattr , " nouser_xattr " } ,
2013-10-07 11:36:20 +09:00
{ Opt_acl , " acl " } ,
2012-11-02 17:07:47 +09:00
{ Opt_noacl , " noacl " } ,
{ Opt_active_logs , " active_logs=%u " } ,
{ Opt_disable_ext_identify , " disable_ext_identify " } ,
2013-08-08 15:16:22 +09:00
{ Opt_inline_xattr , " inline_xattr " } ,
2017-02-15 10:34:45 +08:00
{ Opt_noinline_xattr , " noinline_xattr " } ,
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
{ Opt_inline_xattr_size , " inline_xattr_size=%u " } ,
2013-11-10 23:13:17 +08:00
{ Opt_inline_data , " inline_data " } ,
2014-09-24 18:16:13 +08:00
{ Opt_inline_dentry , " inline_dentry " } ,
2016-05-09 19:56:34 +08:00
{ Opt_noinline_dentry , " noinline_dentry " } ,
2014-04-02 15:34:36 +09:00
{ Opt_flush_merge , " flush_merge " } ,
2016-05-20 22:39:20 -07:00
{ Opt_noflush_merge , " noflush_merge " } ,
2022-10-25 01:54:01 +08:00
{ Opt_barrier , " barrier " } ,
2014-07-23 09:57:31 -07:00
{ Opt_nobarrier , " nobarrier " } ,
2014-10-30 22:47:03 -07:00
{ Opt_fastboot , " fastboot " } ,
2015-02-05 17:55:51 +08:00
{ Opt_extent_cache , " extent_cache " } ,
2015-06-25 17:43:04 -07:00
{ Opt_noextent_cache , " noextent_cache " } ,
2015-03-24 10:20:27 +08:00
{ Opt_noinline_data , " noinline_data " } ,
2015-12-16 13:12:16 +08:00
{ Opt_data_flush , " data_flush " } ,
2017-12-27 15:05:52 -08:00
{ Opt_reserve_root , " reserve_root=%u " } ,
2018-01-04 21:36:09 -08:00
{ Opt_resgid , " resgid=%u " } ,
{ Opt_resuid , " resuid=%u " } ,
2016-06-03 19:29:38 -07:00
{ Opt_mode , " mode=%s " } ,
2016-12-21 17:09:19 -08:00
{ Opt_io_size_bits , " io_bits=%u " } ,
2016-04-29 15:34:32 -07:00
{ Opt_fault_injection , " fault_injection=%u " } ,
2018-08-08 17:36:41 +08:00
{ Opt_fault_type , " fault_type=%u " } ,
2016-05-20 21:47:24 -07:00
{ Opt_lazytime , " lazytime " } ,
{ Opt_nolazytime , " nolazytime " } ,
2017-08-08 10:54:31 +08:00
{ Opt_quota , " quota " } ,
{ Opt_noquota , " noquota " } ,
2017-07-09 00:13:07 +08:00
{ Opt_usrquota , " usrquota " } ,
{ Opt_grpquota , " grpquota " } ,
2017-07-26 00:01:41 +08:00
{ Opt_prjquota , " prjquota " } ,
2017-08-08 10:54:31 +08:00
{ Opt_usrjquota , " usrjquota=%s " } ,
{ Opt_grpjquota , " grpjquota=%s " } ,
{ Opt_prjjquota , " prjjquota=%s " } ,
{ Opt_offusrjquota , " usrjquota= " } ,
{ Opt_offgrpjquota , " grpjquota= " } ,
{ Opt_offprjjquota , " prjjquota= " } ,
{ Opt_jqfmt_vfsold , " jqfmt=vfsold " } ,
{ Opt_jqfmt_vfsv0 , " jqfmt=vfsv0 " } ,
{ Opt_jqfmt_vfsv1 , " jqfmt=vfsv1 " } ,
2018-02-18 08:50:49 -08:00
{ Opt_alloc , " alloc_mode=%s " } ,
2018-03-07 12:07:49 +08:00
{ Opt_fsync , " fsync_mode=%s " } ,
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
{ Opt_test_dummy_encryption , " test_dummy_encryption=%s " } ,
2018-03-15 18:51:42 +08:00
{ Opt_test_dummy_encryption , " test_dummy_encryption " } ,
2020-07-02 01:56:06 +00:00
{ Opt_inlinecrypt , " inlinecrypt " } ,
2019-05-29 17:49:06 -07:00
{ Opt_checkpoint_disable , " checkpoint=disable " } ,
{ Opt_checkpoint_disable_cap , " checkpoint=disable:%u " } ,
{ Opt_checkpoint_disable_cap_perc , " checkpoint=disable:%u%% " } ,
{ Opt_checkpoint_enable , " checkpoint=enable " } ,
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
{ Opt_checkpoint_merge , " checkpoint_merge " } ,
{ Opt_nocheckpoint_merge , " nocheckpoint_merge " } ,
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
{ Opt_compress_algorithm , " compress_algorithm=%s " } ,
{ Opt_compress_log_size , " compress_log_size=%u " } ,
{ Opt_compress_extension , " compress_extension=%s " } ,
2021-06-08 19:15:08 +08:00
{ Opt_nocompress_extension , " nocompress_extension=%s " } ,
2020-11-26 18:32:09 +08:00
{ Opt_compress_chksum , " compress_chksum " } ,
2020-12-01 13:08:02 +09:00
{ Opt_compress_mode , " compress_mode=%s " } ,
2021-05-20 19:51:50 +08:00
{ Opt_compress_cache , " compress_cache " } ,
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
{ Opt_atgc , " atgc " } ,
2021-03-27 17:57:06 +08:00
{ Opt_gc_merge , " gc_merge " } ,
{ Opt_nogc_merge , " nogc_merge " } ,
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
{ Opt_discard_unit , " discard_unit=%s " } ,
2022-06-20 10:38:42 -07:00
{ Opt_memory_mode , " memory=%s " } ,
2022-12-01 17:37:15 -08:00
{ Opt_age_extent_cache , " age_extent_cache " } ,
2012-11-02 17:07:47 +09:00
{ Opt_err , NULL } ,
} ;
2019-06-18 17:48:42 +08:00
void f2fs_printk ( struct f2fs_sb_info * sbi , const char * fmt , . . . )
2012-12-30 14:52:05 +09:00
{
struct va_format vaf ;
va_list args ;
2019-06-18 17:48:42 +08:00
int level ;
2012-12-30 14:52:05 +09:00
va_start ( args , fmt ) ;
2019-06-18 17:48:42 +08:00
level = printk_get_level ( fmt ) ;
vaf . fmt = printk_skip_level ( fmt ) ;
2012-12-30 14:52:05 +09:00
vaf . va = & args ;
2019-06-18 17:48:42 +08:00
printk ( " %c%cF2FS-fs (%s): %pV \n " ,
KERN_SOH_ASCII , level , sbi - > sb - > s_id , & vaf ) ;
2012-12-30 14:52:05 +09:00
va_end ( args ) ;
}
2022-01-18 07:56:14 +01:00
# if IS_ENABLED(CONFIG_UNICODE)
2019-07-23 16:05:28 -07:00
static const struct f2fs_sb_encodings {
__u16 magic ;
char * name ;
2021-09-15 09:00:00 +02:00
unsigned int version ;
2019-07-23 16:05:28 -07:00
} f2fs_sb_encoding_map [ ] = {
2021-09-15 09:00:00 +02:00
{ F2FS_ENC_UTF8_12_1 , " utf8 " , UNICODE_AGE ( 12 , 1 , 0 ) } ,
2019-07-23 16:05:28 -07:00
} ;
2021-09-15 08:59:57 +02:00
static const struct f2fs_sb_encodings *
f2fs_sb_read_encoding ( const struct f2fs_super_block * sb )
2019-07-23 16:05:28 -07:00
{
__u16 magic = le16_to_cpu ( sb - > s_encoding ) ;
int i ;
for ( i = 0 ; i < ARRAY_SIZE ( f2fs_sb_encoding_map ) ; i + + )
if ( magic = = f2fs_sb_encoding_map [ i ] . magic )
2021-09-15 08:59:57 +02:00
return & f2fs_sb_encoding_map [ i ] ;
2019-07-23 16:05:28 -07:00
2021-09-15 08:59:57 +02:00
return NULL ;
2019-07-23 16:05:28 -07:00
}
2021-06-11 07:46:30 +08:00
struct kmem_cache * f2fs_cf_name_slab ;
static int __init f2fs_create_casefold_cache ( void )
{
f2fs_cf_name_slab = f2fs_kmem_cache_create ( " f2fs_casefolded_name " ,
F2FS_NAME_LEN ) ;
2022-11-25 19:47:36 +08:00
return f2fs_cf_name_slab ? 0 : - ENOMEM ;
2021-06-11 07:46:30 +08:00
}
static void f2fs_destroy_casefold_cache ( void )
{
kmem_cache_destroy ( f2fs_cf_name_slab ) ;
}
# else
static int __init f2fs_create_casefold_cache ( void ) { return 0 ; }
static void f2fs_destroy_casefold_cache ( void ) { }
2019-07-23 16:05:28 -07:00
# endif
2017-12-27 15:05:52 -08:00
static inline void limit_reserve_root ( struct f2fs_sb_info * sbi )
{
2022-08-23 10:18:42 -07:00
block_t limit = min ( ( sbi - > user_block_count > > 3 ) ,
2019-05-29 17:49:04 -07:00
sbi - > user_block_count - sbi - > reserved_blocks ) ;
2017-12-27 15:05:52 -08:00
2022-08-23 10:18:42 -07:00
/* limit is 12.5% */
2018-03-08 14:22:56 +08:00
if ( test_opt ( sbi , RESERVE_ROOT ) & &
F2FS_OPTION ( sbi ) . root_reserved_blocks > limit ) {
F2FS_OPTION ( sbi ) . root_reserved_blocks = limit ;
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Reduce reserved blocks for root = %u " ,
F2FS_OPTION ( sbi ) . root_reserved_blocks ) ;
2017-12-27 15:05:52 -08:00
}
2018-01-04 21:36:09 -08:00
if ( ! test_opt ( sbi , RESERVE_ROOT ) & &
2018-03-08 14:22:56 +08:00
( ! uid_eq ( F2FS_OPTION ( sbi ) . s_resuid ,
2018-01-04 21:36:09 -08:00
make_kuid ( & init_user_ns , F2FS_DEF_RESUID ) ) | |
2018-03-08 14:22:56 +08:00
! gid_eq ( F2FS_OPTION ( sbi ) . s_resgid ,
2018-01-04 21:36:09 -08:00
make_kgid ( & init_user_ns , F2FS_DEF_RESGID ) ) ) )
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Ignore s_resuid=%u, s_resgid=%u w/o reserve_root " ,
from_kuid_munged ( & init_user_ns ,
F2FS_OPTION ( sbi ) . s_resuid ) ,
from_kgid_munged ( & init_user_ns ,
F2FS_OPTION ( sbi ) . s_resgid ) ) ;
2017-12-27 15:05:52 -08:00
}
f2fs: fix to reserve space for IO align feature
https://bugzilla.kernel.org/show_bug.cgi?id=204137
With below script, we will hit panic during new segment allocation:
DISK=bingo.img
MOUNT_DIR=/mnt/f2fs
dd if=/dev/zero of=$DISK bs=1M count=105
mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK
mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict"
for (( i = 0; i < 4096; i++ )); do
name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10`
mkdir $MOUNT_DIR/$name
done
umount $MOUNT_DIR
rm $DISK
--- Core dump ---
Call Trace:
allocate_segment_by_default+0x9d/0x100 [f2fs]
f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs]
do_write_page+0x62/0x110 [f2fs]
f2fs_outplace_write_data+0x43/0xc0 [f2fs]
f2fs_do_write_data_page+0x386/0x560 [f2fs]
__write_data_page+0x706/0x850 [f2fs]
f2fs_write_cache_pages+0x267/0x6a0 [f2fs]
f2fs_write_data_pages+0x19c/0x2e0 [f2fs]
do_writepages+0x1c/0x70
__filemap_fdatawrite_range+0xaa/0xe0
filemap_fdatawrite+0x1f/0x30
f2fs_sync_dirty_inodes+0x74/0x1f0 [f2fs]
block_operations+0xdc/0x350 [f2fs]
f2fs_write_checkpoint+0x104/0x1150 [f2fs]
f2fs_sync_fs+0xa2/0x120 [f2fs]
f2fs_balance_fs_bg+0x33c/0x390 [f2fs]
f2fs_write_node_pages+0x4c/0x1f0 [f2fs]
do_writepages+0x1c/0x70
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x273/0x5c0
wb_writeback+0xff/0x2e0
wb_workfn+0xa1/0x370
process_one_work+0x138/0x350
worker_thread+0x4d/0x3d0
kthread+0x109/0x140
ret_from_fork+0x25/0x30
The root cause here is, with IO alignment feature enables, in worst
case, we need F2FS_IO_SIZE() free blocks space for single one 4k write
due to IO alignment feature will fill dummy pages to make IO being
aligned.
So we will easily run out of free segments during non-inline directory's
data writeback, even in process of foreground GC.
In order to fix this issue, I just propose to reserve additional free
space for IO alignment feature to handle worst case of free space usage
ratio during FGGC.
Fixes: 0a595ebaaa6b ("f2fs: support IO alignment for DATA and NODE writes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-11 21:27:36 +08:00
static inline int adjust_reserved_segment ( struct f2fs_sb_info * sbi )
{
unsigned int sec_blks = sbi - > blocks_per_seg * sbi - > segs_per_sec ;
unsigned int avg_vblocks ;
unsigned int wanted_reserved_segments ;
block_t avail_user_block_count ;
if ( ! F2FS_IO_ALIGNED ( sbi ) )
return 0 ;
/* average valid block count in section in worst case */
avg_vblocks = sec_blks / F2FS_IO_SIZE ( sbi ) ;
/*
* we need enough free space when migrating one section in worst case
*/
wanted_reserved_segments = ( F2FS_IO_SIZE ( sbi ) / avg_vblocks ) *
reserved_segments ( sbi ) ;
wanted_reserved_segments - = reserved_segments ( sbi ) ;
avail_user_block_count = sbi - > user_block_count -
sbi - > current_reserved_blocks -
F2FS_OPTION ( sbi ) . root_reserved_blocks ;
if ( wanted_reserved_segments * sbi - > blocks_per_seg >
avail_user_block_count ) {
f2fs_err ( sbi , " IO align feature can't grab additional reserved segment: %u, available segments: %u " ,
wanted_reserved_segments ,
avail_user_block_count > > sbi - > log_blocks_per_seg ) ;
return - ENOSPC ;
}
SM_I ( sbi ) - > additional_reserved_segments = wanted_reserved_segments ;
f2fs_info ( sbi , " IO align feature needs additional reserved segment: %u " ,
wanted_reserved_segments ) ;
return 0 ;
}
2020-05-15 17:20:50 -07:00
static inline void adjust_unusable_cap_perc ( struct f2fs_sb_info * sbi )
{
if ( ! F2FS_OPTION ( sbi ) . unusable_cap_perc )
return ;
if ( F2FS_OPTION ( sbi ) . unusable_cap_perc = = 100 )
F2FS_OPTION ( sbi ) . unusable_cap = sbi - > user_block_count ;
else
F2FS_OPTION ( sbi ) . unusable_cap = ( sbi - > user_block_count / 100 ) *
F2FS_OPTION ( sbi ) . unusable_cap_perc ;
f2fs_info ( sbi , " Adjust unusable cap for checkpoint=disable = %u / %u%% " ,
F2FS_OPTION ( sbi ) . unusable_cap ,
F2FS_OPTION ( sbi ) . unusable_cap_perc ) ;
}
2012-11-02 17:07:47 +09:00
static void init_once ( void * foo )
{
struct f2fs_inode_info * fi = ( struct f2fs_inode_info * ) foo ;
inode_init_once ( & fi - > vfs_inode ) ;
}
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
static const char * const quotatypes [ ] = INITQFNAMES ;
# define QTYPE2NAME(t) (quotatypes[t])
static int f2fs_set_qf_name ( struct super_block * sb , int qtype ,
substring_t * args )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
char * qname ;
int ret = - EINVAL ;
2018-03-08 14:22:56 +08:00
if ( sb_any_quota_loaded ( sb ) & & ! F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot change journaled quota options when quota turned on " ) ;
2017-08-08 10:54:31 +08:00
return - EINVAL ;
}
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " QUOTA feature is enabled, so ignore qf_name " ) ;
2017-10-06 09:14:28 -07:00
return 0 ;
}
2017-08-08 10:54:31 +08:00
qname = match_strdup ( args ) ;
if ( ! qname ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Not enough memory for storing quotafile name " ) ;
2019-01-01 21:33:11 +08:00
return - ENOMEM ;
2017-08-08 10:54:31 +08:00
}
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] ) {
if ( strcmp ( F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] , qname ) = = 0 )
2017-08-08 10:54:31 +08:00
ret = 0 ;
else
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " %s quota file already specified " ,
2017-08-08 10:54:31 +08:00
QTYPE2NAME ( qtype ) ) ;
goto errout ;
}
if ( strchr ( qname , ' / ' ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " quotafile must be on filesystem root " ) ;
2017-08-08 10:54:31 +08:00
goto errout ;
}
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] = qname ;
2017-08-08 10:54:31 +08:00
set_opt ( sbi , QUOTA ) ;
return 0 ;
errout :
2020-06-17 20:30:12 +08:00
kfree ( qname ) ;
2017-08-08 10:54:31 +08:00
return ret ;
}
static int f2fs_clear_qf_name ( struct super_block * sb , int qtype )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2018-03-08 14:22:56 +08:00
if ( sb_any_quota_loaded ( sb ) & & F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot change journaled quota options when quota turned on " ) ;
2017-08-08 10:54:31 +08:00
return - EINVAL ;
}
2020-06-17 20:30:12 +08:00
kfree ( F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] ) ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_qf_names [ qtype ] = NULL ;
2017-08-08 10:54:31 +08:00
return 0 ;
}
static int f2fs_check_quota_options ( struct f2fs_sb_info * sbi )
{
/*
* We do the test below only for project quotas . ' usrquota ' and
* ' grpquota ' mount options are allowed even without quota feature
* to support legacy quotas in quota files .
*/
2018-10-24 18:34:26 +08:00
if ( test_opt ( sbi , PRJQUOTA ) & & ! f2fs_sb_has_project_quota ( sbi ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Project quota feature not enabled. Cannot enable project quota enforcement. " ) ;
2017-08-08 10:54:31 +08:00
return - 1 ;
}
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ USRQUOTA ] | |
F2FS_OPTION ( sbi ) . s_qf_names [ GRPQUOTA ] | |
F2FS_OPTION ( sbi ) . s_qf_names [ PRJQUOTA ] ) {
if ( test_opt ( sbi , USRQUOTA ) & &
F2FS_OPTION ( sbi ) . s_qf_names [ USRQUOTA ] )
2017-08-08 10:54:31 +08:00
clear_opt ( sbi , USRQUOTA ) ;
2018-03-08 14:22:56 +08:00
if ( test_opt ( sbi , GRPQUOTA ) & &
F2FS_OPTION ( sbi ) . s_qf_names [ GRPQUOTA ] )
2017-08-08 10:54:31 +08:00
clear_opt ( sbi , GRPQUOTA ) ;
2018-03-08 14:22:56 +08:00
if ( test_opt ( sbi , PRJQUOTA ) & &
F2FS_OPTION ( sbi ) . s_qf_names [ PRJQUOTA ] )
2017-08-08 10:54:31 +08:00
clear_opt ( sbi , PRJQUOTA ) ;
if ( test_opt ( sbi , GRPQUOTA ) | | test_opt ( sbi , USRQUOTA ) | |
test_opt ( sbi , PRJQUOTA ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " old and new quota format mixing " ) ;
2017-08-08 10:54:31 +08:00
return - 1 ;
}
2018-03-08 14:22:56 +08:00
if ( ! F2FS_OPTION ( sbi ) . s_jquota_fmt ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " journaled quota format not specified " ) ;
2017-08-08 10:54:31 +08:00
return - 1 ;
}
}
2017-10-06 09:14:28 -07:00
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) & & F2FS_OPTION ( sbi ) . s_jquota_fmt ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " QUOTA feature is enabled, so ignore jquota_fmt " ) ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_jquota_fmt = 0 ;
2017-10-06 09:14:28 -07:00
}
2017-08-08 10:54:31 +08:00
return 0 ;
}
# endif
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
static int f2fs_set_test_dummy_encryption ( struct super_block * sb ,
const char * opt ,
const substring_t * arg ,
bool is_remount )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2022-06-06 15:32:41 -07:00
struct fs_parameter param = {
. type = fs_value_is_string ,
. string = arg - > from ? arg - > from : " " ,
} ;
struct fscrypt_dummy_policy * policy =
& F2FS_OPTION ( sbi ) . dummy_enc_policy ;
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
int err ;
2022-06-06 15:32:41 -07:00
if ( ! IS_ENABLED ( CONFIG_FS_ENCRYPTION ) ) {
f2fs_warn ( sbi , " test_dummy_encryption option not supported " ) ;
return - EINVAL ;
}
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
if ( ! f2fs_sb_has_encrypt ( sbi ) ) {
f2fs_err ( sbi , " Encrypt feature is off " ) ;
return - EINVAL ;
}
/*
* This mount option is just for testing , and it ' s not worthwhile to
* implement the extra complexity ( e . g . RCU protection ) that would be
* needed to allow it to be set or changed during remount . We do allow
* it to be specified during remount , but only if there is no change .
*/
2022-06-06 15:32:41 -07:00
if ( is_remount & & ! fscrypt_is_dummy_policy_set ( policy ) ) {
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
f2fs_warn ( sbi , " Can't set test_dummy_encryption on remount " ) ;
return - EINVAL ;
}
2022-06-06 15:32:41 -07:00
err = fscrypt_parse_test_dummy_encryption ( & param , policy ) ;
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
if ( err ) {
if ( err = = - EEXIST )
f2fs_warn ( sbi ,
" Can't change test_dummy_encryption on remount " ) ;
else if ( err = = - EINVAL )
f2fs_warn ( sbi , " Value of option \" %s \" is unrecognized " ,
opt ) ;
else
f2fs_warn ( sbi , " Error processing option \" %s \" [%d] " ,
opt , err ) ;
return - EINVAL ;
}
f2fs_warn ( sbi , " Test dummy encryption mode enabled " ) ;
2022-04-30 22:08:52 -07:00
return 0 ;
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
}
2021-01-22 17:46:43 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
2021-06-08 19:15:08 +08:00
/*
* 1. The same extension name cannot not appear in both compress and non - compress extension
* at the same time .
* 2. If the compress extension specifies all files , the types specified by the non - compress
* extension will be treated as special cases and will not be compressed .
* 3. Don ' t allow the non - compress extension specifies all files .
*/
static int f2fs_test_compress_extension ( struct f2fs_sb_info * sbi )
{
unsigned char ( * ext ) [ F2FS_EXTENSION_LEN ] ;
unsigned char ( * noext ) [ F2FS_EXTENSION_LEN ] ;
int ext_cnt , noext_cnt , index = 0 , no_index = 0 ;
ext = F2FS_OPTION ( sbi ) . extensions ;
ext_cnt = F2FS_OPTION ( sbi ) . compress_ext_cnt ;
noext = F2FS_OPTION ( sbi ) . noextensions ;
noext_cnt = F2FS_OPTION ( sbi ) . nocompress_ext_cnt ;
if ( ! noext_cnt )
return 0 ;
for ( no_index = 0 ; no_index < noext_cnt ; no_index + + ) {
if ( ! strcasecmp ( " * " , noext [ no_index ] ) ) {
f2fs_info ( sbi , " Don't allow the nocompress extension specifies all files " ) ;
return - EINVAL ;
}
for ( index = 0 ; index < ext_cnt ; index + + ) {
if ( ! strcasecmp ( ext [ index ] , noext [ no_index ] ) ) {
f2fs_info ( sbi , " Don't allow the same extension %s appear in both compress and nocompress extension " ,
ext [ index ] ) ;
return - EINVAL ;
}
}
}
return 0 ;
}
2021-01-22 17:46:43 +08:00
# ifdef CONFIG_F2FS_FS_LZ4
static int f2fs_set_lz4hc_level ( struct f2fs_sb_info * sbi , const char * str )
{
# ifdef CONFIG_F2FS_FS_LZ4HC
unsigned int level ;
# endif
if ( strlen ( str ) = = 3 ) {
F2FS_OPTION ( sbi ) . compress_level = 0 ;
return 0 ;
}
# ifdef CONFIG_F2FS_FS_LZ4HC
str + = 3 ;
if ( str [ 0 ] ! = ' : ' ) {
f2fs_info ( sbi , " wrong format, e.g. <alg_name>:<compr_level> " ) ;
return - EINVAL ;
}
if ( kstrtouint ( str + 1 , 10 , & level ) )
return - EINVAL ;
if ( level < LZ4HC_MIN_CLEVEL | | level > LZ4HC_MAX_CLEVEL ) {
f2fs_info ( sbi , " invalid lz4hc compress level: %d " , level ) ;
return - EINVAL ;
}
F2FS_OPTION ( sbi ) . compress_level = level ;
return 0 ;
# else
f2fs_info ( sbi , " kernel doesn't support lz4hc compression " ) ;
return - EINVAL ;
# endif
}
# endif
# ifdef CONFIG_F2FS_FS_ZSTD
static int f2fs_set_zstd_level ( struct f2fs_sb_info * sbi , const char * str )
{
unsigned int level ;
int len = 4 ;
if ( strlen ( str ) = = len ) {
F2FS_OPTION ( sbi ) . compress_level = 0 ;
return 0 ;
}
str + = len ;
if ( str [ 0 ] ! = ' : ' ) {
f2fs_info ( sbi , " wrong format, e.g. <alg_name>:<compr_level> " ) ;
return - EINVAL ;
}
if ( kstrtouint ( str + 1 , 10 , & level ) )
return - EINVAL ;
2020-09-11 16:49:00 -07:00
if ( ! level | | level > zstd_max_clevel ( ) ) {
2021-01-22 17:46:43 +08:00
f2fs_info ( sbi , " invalid zstd compress level: %d " , level ) ;
return - EINVAL ;
}
F2FS_OPTION ( sbi ) . compress_level = level ;
return 0 ;
}
# endif
# endif
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
static int parse_options ( struct super_block * sb , char * options , bool is_remount )
2013-06-16 09:48:48 +09:00
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
substring_t args [ MAX_OPT_ARGS ] ;
2020-07-29 21:21:36 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
unsigned char ( * ext ) [ F2FS_EXTENSION_LEN ] ;
2021-06-08 19:15:08 +08:00
unsigned char ( * noext ) [ F2FS_EXTENSION_LEN ] ;
int ext_cnt , noext_cnt ;
2020-07-29 21:21:36 +08:00
# endif
2013-06-16 09:48:48 +09:00
char * p , * name ;
2020-07-29 21:21:36 +08:00
int arg = 0 ;
2018-01-04 21:36:09 -08:00
kuid_t uid ;
kgid_t gid ;
2017-08-08 10:54:31 +08:00
int ret ;
2013-06-16 09:48:48 +09:00
if ( ! options )
2021-05-21 01:32:53 -07:00
goto default_check ;
2013-06-16 09:48:48 +09:00
while ( ( p = strsep ( & options , " , " ) ) ! = NULL ) {
int token ;
2021-04-06 09:47:35 +08:00
2013-06-16 09:48:48 +09:00
if ( ! * p )
continue ;
/*
* Initialize args struct so we know whether arg was
* found ; some options take optional arguments .
*/
args [ 0 ] . to = args [ 0 ] . from = NULL ;
token = match_token ( p , f2fs_tokens , args ) ;
switch ( token ) {
case Opt_gc_background :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
2020-05-01 16:35:23 -07:00
if ( ! strcmp ( name , " on " ) ) {
2020-02-14 17:44:13 +08:00
F2FS_OPTION ( sbi ) . bggc_mode = BGGC_MODE_ON ;
2020-05-01 16:35:23 -07:00
} else if ( ! strcmp ( name , " off " ) ) {
2020-02-14 17:44:13 +08:00
F2FS_OPTION ( sbi ) . bggc_mode = BGGC_MODE_OFF ;
2020-05-01 16:35:23 -07:00
} else if ( ! strcmp ( name , " sync " ) ) {
2020-02-14 17:44:13 +08:00
F2FS_OPTION ( sbi ) . bggc_mode = BGGC_MODE_SYNC ;
2015-10-05 11:02:54 -07:00
} else {
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2013-06-16 09:48:48 +09:00
return - EINVAL ;
}
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2013-06-16 09:48:48 +09:00
break ;
case Opt_disable_roll_forward :
set_opt ( sbi , DISABLE_ROLL_FORWARD ) ;
break ;
2015-01-23 18:33:46 -08:00
case Opt_norecovery :
/* this option mounts f2fs with ro */
2020-02-14 17:45:11 +08:00
set_opt ( sbi , NORECOVERY ) ;
2015-01-23 18:33:46 -08:00
if ( ! f2fs_readonly ( sb ) )
return - EINVAL ;
break ;
2013-06-16 09:48:48 +09:00
case Opt_discard :
2021-08-30 08:35:33 +08:00
if ( ! f2fs_hw_support_discard ( sbi ) ) {
f2fs_warn ( sbi , " device does not support discard " ) ;
break ;
}
f2fs: fix to avoid NULL pointer dereference on se->discard_map
https://bugzilla.kernel.org/show_bug.cgi?id=200951
These is a NULL pointer dereference issue reported in bugzilla:
Hi,
in the setup there is a SATA SSD connected to a SATA-to-USB bridge.
The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
There are four partitions:
sda1: FAT /boot
sda2: F2FS /
sda3: F2FS /home
sda4: F2FS
The bridge is ASMT1153e which uses the "uas" driver.
There is no TRIM pass-through, so, when mounting it reports:
mounting with "discard" option, but the device does not support discard
The USB host is USB3.0 and UASP capable. It is the one on RK3399.
Given this everything works fine, except there is no TRIM support.
In order to enable TRIM a new UDEV rule is added [1]:
/etc/udev/rules.d/10-sata-bridge-trim.rules:
ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
After reboot any F2FS write hangs forever and dmesg reports:
Unable to handle kernel NULL pointer dereference
Also tested on a x86_64 system: works fine even with TRIM enabled.
same disc
same bridge
different usb host controller
different cpu architecture
not root filesystem
Regards,
Vicenç.
[1] Post #5 in https://bbs.archlinux.org/viewtopic.php?id=236280
Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
Mem abort info:
ESR = 0x96000004
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
[000000000000003e] pgd=0000000000000000
Internal error: Oops: 96000004 [#1] SMP
Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
Hardware name: Sapphire-RK3399 Board (DT)
pstate: 00000005 (nzcv daif -PAN -UAO)
pc : update_sit_entry+0x304/0x4b0
lr : update_sit_entry+0x108/0x4b0
sp : ffff00000ca13bd0
x29: ffff00000ca13bd0 x28: 000000000000003e
x27: 0000000000000020 x26: 0000000000080000
x25: 0000000000000048 x24: ffff8000ebb85cf8
x23: 0000000000000253 x22: 00000000ffffffff
x21: 00000000000535f2 x20: 00000000ffffffdf
x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
x17: 0000000007ce6926 x16: 000000001c83ffa8
x15: 0000000000000000 x14: ffff8000f602df90
x13: 0000000000000006 x12: 0000000000000040
x11: 0000000000000228 x10: 0000000000000000
x9 : 0000000000000000 x8 : 0000000000000000
x7 : 00000000000535f2 x6 : ffff8000ebff3440
x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
x3 : 00000000ffffffff x2 : 0000000000000020
x1 : 0000000000000000 x0 : ffff8000eb9e5800
Process nvim (pid: 957, stack limit = 0x0000000063a78320)
Call trace:
update_sit_entry+0x304/0x4b0
f2fs_invalidate_blocks+0x98/0x140
truncate_node+0x90/0x400
f2fs_remove_inode_page+0xe8/0x340
f2fs_evict_inode+0x2b0/0x408
evict+0xe0/0x1e0
iput+0x160/0x260
do_unlinkat+0x214/0x298
__arm64_sys_unlinkat+0x3c/0x68
el0_svc_handler+0x94/0x118
el0_svc+0x8/0xc
Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
---[ end trace a0f21a307118c477 ]---
The reason is it is possible to enable discard flag on block queue via
UDEV, but during mount, f2fs will initialize se->discard_map only if
this flag is set, once the flag is set after mount, f2fs may dereference
NULL pointer on se->discard_map.
So this patch does below changes to fix this issue:
- initialize and update se->discard_map all the time.
- don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
during mount.
- don't issue small discard on zoned block device.
- introduce some functions to enhance the readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-04 03:52:17 +08:00
set_opt ( sbi , DISCARD ) ;
2013-06-16 09:48:48 +09:00
break ;
2016-07-03 22:05:14 +08:00
case Opt_nodiscard :
2021-08-30 08:35:33 +08:00
if ( f2fs_hw_should_discard ( sbi ) ) {
2019-06-18 17:48:42 +08:00
f2fs_warn ( sbi , " discard is required for zoned block devices " ) ;
2016-10-28 17:45:03 +09:00
return - EINVAL ;
}
2016-07-03 22:05:14 +08:00
clear_opt ( sbi , DISCARD ) ;
2016-10-28 17:44:59 +09:00
break ;
2013-06-16 09:48:48 +09:00
case Opt_noheap :
set_opt ( sbi , NOHEAP ) ;
break ;
2017-03-24 20:41:45 -04:00
case Opt_heap :
clear_opt ( sbi , NOHEAP ) ;
break ;
2013-06-16 09:48:48 +09:00
# ifdef CONFIG_F2FS_FS_XATTR
2013-10-07 11:36:20 +09:00
case Opt_user_xattr :
set_opt ( sbi , XATTR_USER ) ;
break ;
2013-06-16 09:48:48 +09:00
case Opt_nouser_xattr :
clear_opt ( sbi , XATTR_USER ) ;
break ;
2013-08-08 15:16:22 +09:00
case Opt_inline_xattr :
set_opt ( sbi , INLINE_XATTR ) ;
break ;
2017-02-15 10:34:45 +08:00
case Opt_noinline_xattr :
clear_opt ( sbi , INLINE_XATTR ) ;
break ;
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
case Opt_inline_xattr_size :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
set_opt ( sbi , INLINE_XATTR_SIZE ) ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . inline_xattr_size = arg ;
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
break ;
2013-06-16 09:48:48 +09:00
# else
2013-10-07 11:36:20 +09:00
case Opt_user_xattr :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " user_xattr options not supported " ) ;
2013-10-07 11:36:20 +09:00
break ;
2013-06-16 09:48:48 +09:00
case Opt_nouser_xattr :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " nouser_xattr options not supported " ) ;
2013-06-16 09:48:48 +09:00
break ;
2013-08-08 15:16:22 +09:00
case Opt_inline_xattr :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " inline_xattr options not supported " ) ;
2013-08-08 15:16:22 +09:00
break ;
2017-02-15 10:34:45 +08:00
case Opt_noinline_xattr :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " noinline_xattr options not supported " ) ;
2017-02-15 10:34:45 +08:00
break ;
2013-06-16 09:48:48 +09:00
# endif
# ifdef CONFIG_F2FS_FS_POSIX_ACL
2013-10-07 11:36:20 +09:00
case Opt_acl :
set_opt ( sbi , POSIX_ACL ) ;
break ;
2013-06-16 09:48:48 +09:00
case Opt_noacl :
clear_opt ( sbi , POSIX_ACL ) ;
break ;
# else
2013-10-07 11:36:20 +09:00
case Opt_acl :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " acl options not supported " ) ;
2013-10-07 11:36:20 +09:00
break ;
2013-06-16 09:48:48 +09:00
case Opt_noacl :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " noacl options not supported " ) ;
2013-06-16 09:48:48 +09:00
break ;
# endif
case Opt_active_logs :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
f2fs: introduce inmem curseg
Previous implementation of aligned pinfile allocation will:
- allocate new segment on cold data log no matter whether last used
segment is partially used or not, it makes IOs more random;
- force concurrent cold data/GCed IO going into warm data area, it
can make a bad effect on hot/cold data separation;
In this patch, we introduce a new type of log named 'inmem curseg',
the differents from normal curseg is:
- it reuses existed segment type (CURSEG_XXX_NODE/DATA);
- it only exists in memory, its segno, blkofs, summary will not b
persisted into checkpoint area;
With this new feature, we can enhance scalability of log, special
allocators can be created for purposes:
- pure lfs allocator for aligned pinfile allocation or file
defragmentation
- pure ssr allocator for later feature
So that, let's update aligned pinfile allocation to use this new
inmem curseg fwk.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:45 +08:00
if ( arg ! = 2 & & arg ! = 4 & &
arg ! = NR_CURSEG_PERSIST_TYPE )
2013-06-16 09:48:48 +09:00
return - EINVAL ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . active_logs = arg ;
2013-06-16 09:48:48 +09:00
break ;
case Opt_disable_ext_identify :
set_opt ( sbi , DISABLE_EXT_IDENTIFY ) ;
break ;
2013-11-10 23:13:17 +08:00
case Opt_inline_data :
set_opt ( sbi , INLINE_DATA ) ;
break ;
2014-09-24 18:16:13 +08:00
case Opt_inline_dentry :
set_opt ( sbi , INLINE_DENTRY ) ;
break ;
2016-05-09 19:56:34 +08:00
case Opt_noinline_dentry :
clear_opt ( sbi , INLINE_DENTRY ) ;
break ;
2014-04-02 15:34:36 +09:00
case Opt_flush_merge :
set_opt ( sbi , FLUSH_MERGE ) ;
break ;
2016-05-20 22:39:20 -07:00
case Opt_noflush_merge :
clear_opt ( sbi , FLUSH_MERGE ) ;
break ;
2014-07-23 09:57:31 -07:00
case Opt_nobarrier :
set_opt ( sbi , NOBARRIER ) ;
break ;
2022-10-25 01:54:01 +08:00
case Opt_barrier :
clear_opt ( sbi , NOBARRIER ) ;
break ;
2014-10-30 22:47:03 -07:00
case Opt_fastboot :
set_opt ( sbi , FASTBOOT ) ;
break ;
2015-02-05 17:55:51 +08:00
case Opt_extent_cache :
2022-11-30 09:36:43 -08:00
set_opt ( sbi , READ_EXTENT_CACHE ) ;
2015-02-05 17:55:51 +08:00
break ;
2015-06-25 17:43:04 -07:00
case Opt_noextent_cache :
2022-11-30 09:36:43 -08:00
clear_opt ( sbi , READ_EXTENT_CACHE ) ;
2015-06-25 17:43:04 -07:00
break ;
2015-03-24 10:20:27 +08:00
case Opt_noinline_data :
clear_opt ( sbi , INLINE_DATA ) ;
break ;
2015-12-16 13:12:16 +08:00
case Opt_data_flush :
set_opt ( sbi , DATA_FLUSH ) ;
break ;
2017-12-27 15:05:52 -08:00
case Opt_reserve_root :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
if ( test_opt ( sbi , RESERVE_ROOT ) ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Preserve previous reserve_root=%u " ,
F2FS_OPTION ( sbi ) . root_reserved_blocks ) ;
2017-12-27 15:05:52 -08:00
} else {
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . root_reserved_blocks = arg ;
2017-12-27 15:05:52 -08:00
set_opt ( sbi , RESERVE_ROOT ) ;
}
break ;
2018-01-04 21:36:09 -08:00
case Opt_resuid :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
uid = make_kuid ( current_user_ns ( ) , arg ) ;
if ( ! uid_valid ( uid ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Invalid uid value %d " , arg ) ;
2018-01-04 21:36:09 -08:00
return - EINVAL ;
}
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_resuid = uid ;
2018-01-04 21:36:09 -08:00
break ;
case Opt_resgid :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
gid = make_kgid ( current_user_ns ( ) , arg ) ;
if ( ! gid_valid ( gid ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Invalid gid value %d " , arg ) ;
2018-01-04 21:36:09 -08:00
return - EINVAL ;
}
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_resgid = gid ;
2018-01-04 21:36:09 -08:00
break ;
2016-06-03 19:29:38 -07:00
case Opt_mode :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
2020-05-01 16:35:23 -07:00
if ( ! strcmp ( name , " adaptive " ) ) {
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_blkzoned ( sbi ) ) {
2019-06-18 17:48:42 +08:00
f2fs_warn ( sbi , " adaptive mode is not allowed with zoned block device feature " ) ;
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2016-10-28 17:45:04 +09:00
return - EINVAL ;
}
2020-02-14 17:44:12 +08:00
F2FS_OPTION ( sbi ) . fs_mode = FS_MODE_ADAPTIVE ;
2020-05-01 16:35:23 -07:00
} else if ( ! strcmp ( name , " lfs " ) ) {
2020-02-14 17:44:12 +08:00
F2FS_OPTION ( sbi ) . fs_mode = FS_MODE_LFS ;
2021-09-29 11:12:03 -07:00
} else if ( ! strcmp ( name , " fragment:segment " ) ) {
F2FS_OPTION ( sbi ) . fs_mode = FS_MODE_FRAGMENT_SEG ;
} else if ( ! strcmp ( name , " fragment:block " ) ) {
F2FS_OPTION ( sbi ) . fs_mode = FS_MODE_FRAGMENT_BLK ;
2016-06-03 19:29:38 -07:00
} else {
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2016-06-03 19:29:38 -07:00
return - EINVAL ;
}
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2016-06-03 19:29:38 -07:00
break ;
2016-12-21 17:09:19 -08:00
case Opt_io_size_bits :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
2021-03-11 12:01:37 +01:00
if ( arg < = 0 | | arg > __ilog2_u32 ( BIO_MAX_VECS ) ) {
2019-06-18 17:48:42 +08:00
f2fs_warn ( sbi , " Not support %d, larger than %d " ,
2021-03-11 12:01:37 +01:00
1 < < arg , BIO_MAX_VECS ) ;
2016-12-21 17:09:19 -08:00
return - EINVAL ;
}
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . write_io_size_bits = arg ;
2016-12-21 17:09:19 -08:00
break ;
2018-09-12 13:32:52 +08:00
# ifdef CONFIG_F2FS_FAULT_INJECTION
2016-04-29 15:34:32 -07:00
case Opt_fault_injection :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
2018-08-08 17:36:41 +08:00
f2fs_build_fault_attr ( sbi , arg , F2FS_ALL_FAULT_TYPE ) ;
set_opt ( sbi , FAULT_INJECTION ) ;
break ;
2018-09-12 13:32:52 +08:00
2018-08-08 17:36:41 +08:00
case Opt_fault_type :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
f2fs_build_fault_attr ( sbi , 0 , arg ) ;
2017-01-27 09:35:37 +08:00
set_opt ( sbi , FAULT_INJECTION ) ;
2018-09-12 13:32:52 +08:00
break ;
2016-04-29 15:34:32 -07:00
# else
2018-09-12 13:32:52 +08:00
case Opt_fault_injection :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " fault_injection options not supported " ) ;
2016-04-29 15:34:32 -07:00
break ;
2018-09-12 13:32:52 +08:00
case Opt_fault_type :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " fault_type options not supported " ) ;
2018-09-12 13:32:52 +08:00
break ;
# endif
2016-05-20 21:47:24 -07:00
case Opt_lazytime :
2017-11-27 13:05:09 -08:00
sb - > s_flags | = SB_LAZYTIME ;
2016-05-20 21:47:24 -07:00
break ;
case Opt_nolazytime :
2017-11-27 13:05:09 -08:00
sb - > s_flags & = ~ SB_LAZYTIME ;
2016-05-20 21:47:24 -07:00
break ;
2017-07-09 00:13:07 +08:00
# ifdef CONFIG_QUOTA
2017-08-08 10:54:31 +08:00
case Opt_quota :
2017-07-09 00:13:07 +08:00
case Opt_usrquota :
set_opt ( sbi , USRQUOTA ) ;
break ;
case Opt_grpquota :
set_opt ( sbi , GRPQUOTA ) ;
break ;
2017-07-26 00:01:41 +08:00
case Opt_prjquota :
set_opt ( sbi , PRJQUOTA ) ;
break ;
2017-08-08 10:54:31 +08:00
case Opt_usrjquota :
ret = f2fs_set_qf_name ( sb , USRQUOTA , & args [ 0 ] ) ;
if ( ret )
return ret ;
break ;
case Opt_grpjquota :
ret = f2fs_set_qf_name ( sb , GRPQUOTA , & args [ 0 ] ) ;
if ( ret )
return ret ;
break ;
case Opt_prjjquota :
ret = f2fs_set_qf_name ( sb , PRJQUOTA , & args [ 0 ] ) ;
if ( ret )
return ret ;
break ;
case Opt_offusrjquota :
ret = f2fs_clear_qf_name ( sb , USRQUOTA ) ;
if ( ret )
return ret ;
break ;
case Opt_offgrpjquota :
ret = f2fs_clear_qf_name ( sb , GRPQUOTA ) ;
if ( ret )
return ret ;
break ;
case Opt_offprjjquota :
ret = f2fs_clear_qf_name ( sb , PRJQUOTA ) ;
if ( ret )
return ret ;
break ;
case Opt_jqfmt_vfsold :
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_jquota_fmt = QFMT_VFS_OLD ;
2017-08-08 10:54:31 +08:00
break ;
case Opt_jqfmt_vfsv0 :
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_jquota_fmt = QFMT_VFS_V0 ;
2017-08-08 10:54:31 +08:00
break ;
case Opt_jqfmt_vfsv1 :
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_jquota_fmt = QFMT_VFS_V1 ;
2017-08-08 10:54:31 +08:00
break ;
case Opt_noquota :
clear_opt ( sbi , QUOTA ) ;
clear_opt ( sbi , USRQUOTA ) ;
clear_opt ( sbi , GRPQUOTA ) ;
clear_opt ( sbi , PRJQUOTA ) ;
break ;
2017-07-09 00:13:07 +08:00
# else
2017-08-08 10:54:31 +08:00
case Opt_quota :
2017-07-09 00:13:07 +08:00
case Opt_usrquota :
case Opt_grpquota :
2017-07-26 00:01:41 +08:00
case Opt_prjquota :
2017-08-08 10:54:31 +08:00
case Opt_usrjquota :
case Opt_grpjquota :
case Opt_prjjquota :
case Opt_offusrjquota :
case Opt_offgrpjquota :
case Opt_offprjjquota :
case Opt_jqfmt_vfsold :
case Opt_jqfmt_vfsv0 :
case Opt_jqfmt_vfsv1 :
case Opt_noquota :
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " quota operations not supported " ) ;
2017-07-09 00:13:07 +08:00
break ;
# endif
2018-02-18 08:50:49 -08:00
case Opt_alloc :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
2020-05-01 16:35:23 -07:00
if ( ! strcmp ( name , " default " ) ) {
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . alloc_mode = ALLOC_MODE_DEFAULT ;
2020-05-01 16:35:23 -07:00
} else if ( ! strcmp ( name , " reuse " ) ) {
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . alloc_mode = ALLOC_MODE_REUSE ;
2018-02-18 08:50:49 -08:00
} else {
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2018-02-18 08:50:49 -08:00
return - EINVAL ;
}
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2018-02-18 08:50:49 -08:00
break ;
2018-03-07 12:07:49 +08:00
case Opt_fsync :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
2020-05-01 16:35:23 -07:00
if ( ! strcmp ( name , " posix " ) ) {
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . fsync_mode = FSYNC_MODE_POSIX ;
2020-05-01 16:35:23 -07:00
} else if ( ! strcmp ( name , " strict " ) ) {
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . fsync_mode = FSYNC_MODE_STRICT ;
2020-05-01 16:35:23 -07:00
} else if ( ! strcmp ( name , " nobarrier " ) ) {
2018-05-25 18:02:58 -07:00
F2FS_OPTION ( sbi ) . fsync_mode =
FSYNC_MODE_NOBARRIER ;
2018-03-07 12:07:49 +08:00
} else {
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2018-03-07 12:07:49 +08:00
return - EINVAL ;
}
2020-06-17 20:30:12 +08:00
kfree ( name ) ;
2018-03-07 12:07:49 +08:00
break ;
2018-03-15 18:51:42 +08:00
case Opt_test_dummy_encryption :
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
ret = f2fs_set_test_dummy_encryption ( sb , p , & args [ 0 ] ,
is_remount ) ;
if ( ret )
return ret ;
2018-03-15 18:51:42 +08:00
break ;
2020-07-02 01:56:06 +00:00
case Opt_inlinecrypt :
# ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
sb - > s_flags | = SB_INLINECRYPT ;
# else
f2fs_info ( sbi , " inline encryption not supported " ) ;
# endif
break ;
2019-05-29 17:49:06 -07:00
case Opt_checkpoint_disable_cap_perc :
if ( args - > from & & match_int ( args , & arg ) )
2018-08-20 19:21:43 -07:00
return - EINVAL ;
2019-05-29 17:49:06 -07:00
if ( arg < 0 | | arg > 100 )
return - EINVAL ;
2020-05-15 17:20:50 -07:00
F2FS_OPTION ( sbi ) . unusable_cap_perc = arg ;
2019-05-29 17:49:06 -07:00
set_opt ( sbi , DISABLE_CHECKPOINT ) ;
break ;
case Opt_checkpoint_disable_cap :
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
F2FS_OPTION ( sbi ) . unusable_cap = arg ;
set_opt ( sbi , DISABLE_CHECKPOINT ) ;
break ;
case Opt_checkpoint_disable :
set_opt ( sbi , DISABLE_CHECKPOINT ) ;
break ;
case Opt_checkpoint_enable :
clear_opt ( sbi , DISABLE_CHECKPOINT ) ;
2018-08-20 19:21:43 -07:00
break ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
case Opt_checkpoint_merge :
set_opt ( sbi , MERGE_CHECKPOINT ) ;
break ;
case Opt_nocheckpoint_merge :
clear_opt ( sbi , MERGE_CHECKPOINT ) ;
break ;
2020-07-29 21:21:36 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
case Opt_compress_algorithm :
if ( ! f2fs_sb_has_compression ( sbi ) ) {
2020-09-03 10:14:41 +08:00
f2fs_info ( sbi , " Image doesn't support compression " ) ;
break ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
}
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
2020-05-01 16:35:23 -07:00
if ( ! strcmp ( name , " lzo " ) ) {
2021-01-22 17:40:13 +08:00
# ifdef CONFIG_F2FS_FS_LZO
2021-01-22 17:46:43 +08:00
F2FS_OPTION ( sbi ) . compress_level = 0 ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
F2FS_OPTION ( sbi ) . compress_algorithm =
COMPRESS_LZO ;
2021-01-22 17:40:13 +08:00
# else
f2fs_info ( sbi , " kernel doesn't support lzo compression " ) ;
# endif
2021-01-22 17:46:43 +08:00
} else if ( ! strncmp ( name , " lz4 " , 3 ) ) {
2021-01-22 17:40:13 +08:00
# ifdef CONFIG_F2FS_FS_LZ4
2021-01-22 17:46:43 +08:00
ret = f2fs_set_lz4hc_level ( sbi , name ) ;
if ( ret ) {
kfree ( name ) ;
return - EINVAL ;
}
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
F2FS_OPTION ( sbi ) . compress_algorithm =
COMPRESS_LZ4 ;
2021-01-22 17:40:13 +08:00
# else
f2fs_info ( sbi , " kernel doesn't support lz4 compression " ) ;
# endif
2021-01-22 17:46:43 +08:00
} else if ( ! strncmp ( name , " zstd " , 4 ) ) {
2021-01-22 17:40:13 +08:00
# ifdef CONFIG_F2FS_FS_ZSTD
2021-01-22 17:46:43 +08:00
ret = f2fs_set_zstd_level ( sbi , name ) ;
if ( ret ) {
kfree ( name ) ;
return - EINVAL ;
}
2020-03-03 17:46:02 +08:00
F2FS_OPTION ( sbi ) . compress_algorithm =
COMPRESS_ZSTD ;
2021-01-22 17:40:13 +08:00
# else
f2fs_info ( sbi , " kernel doesn't support zstd compression " ) ;
# endif
2020-04-08 19:56:32 +08:00
} else if ( ! strcmp ( name , " lzo-rle " ) ) {
2021-01-22 17:40:13 +08:00
# ifdef CONFIG_F2FS_FS_LZORLE
2021-01-22 17:46:43 +08:00
F2FS_OPTION ( sbi ) . compress_level = 0 ;
2020-04-08 19:56:32 +08:00
F2FS_OPTION ( sbi ) . compress_algorithm =
COMPRESS_LZORLE ;
2021-01-22 17:40:13 +08:00
# else
f2fs_info ( sbi , " kernel doesn't support lzorle compression " ) ;
# endif
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
} else {
kfree ( name ) ;
return - EINVAL ;
}
kfree ( name ) ;
break ;
case Opt_compress_log_size :
if ( ! f2fs_sb_has_compression ( sbi ) ) {
2020-09-03 10:14:41 +08:00
f2fs_info ( sbi , " Image doesn't support compression " ) ;
break ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
}
if ( args - > from & & match_int ( args , & arg ) )
return - EINVAL ;
if ( arg < MIN_COMPRESS_LOG_SIZE | |
arg > MAX_COMPRESS_LOG_SIZE ) {
f2fs_err ( sbi ,
" Compress cluster log size is out of range " ) ;
return - EINVAL ;
}
F2FS_OPTION ( sbi ) . compress_log_size = arg ;
break ;
case Opt_compress_extension :
if ( ! f2fs_sb_has_compression ( sbi ) ) {
2020-09-03 10:14:41 +08:00
f2fs_info ( sbi , " Image doesn't support compression " ) ;
break ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
}
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
ext = F2FS_OPTION ( sbi ) . extensions ;
ext_cnt = F2FS_OPTION ( sbi ) . compress_ext_cnt ;
if ( strlen ( name ) > = F2FS_EXTENSION_LEN | |
ext_cnt > = COMPRESS_EXT_NUM ) {
f2fs_err ( sbi ,
" invalid extension length/number " ) ;
kfree ( name ) ;
return - EINVAL ;
}
strcpy ( ext [ ext_cnt ] , name ) ;
F2FS_OPTION ( sbi ) . compress_ext_cnt + + ;
kfree ( name ) ;
break ;
2021-06-08 19:15:08 +08:00
case Opt_nocompress_extension :
if ( ! f2fs_sb_has_compression ( sbi ) ) {
f2fs_info ( sbi , " Image doesn't support compression " ) ;
break ;
}
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
noext = F2FS_OPTION ( sbi ) . noextensions ;
noext_cnt = F2FS_OPTION ( sbi ) . nocompress_ext_cnt ;
if ( strlen ( name ) > = F2FS_EXTENSION_LEN | |
noext_cnt > = COMPRESS_EXT_NUM ) {
f2fs_err ( sbi ,
" invalid extension length/number " ) ;
kfree ( name ) ;
return - EINVAL ;
}
strcpy ( noext [ noext_cnt ] , name ) ;
F2FS_OPTION ( sbi ) . nocompress_ext_cnt + + ;
kfree ( name ) ;
break ;
2020-11-26 18:32:09 +08:00
case Opt_compress_chksum :
F2FS_OPTION ( sbi ) . compress_chksum = true ;
break ;
2020-12-01 13:08:02 +09:00
case Opt_compress_mode :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
if ( ! strcmp ( name , " fs " ) ) {
F2FS_OPTION ( sbi ) . compress_mode = COMPR_MODE_FS ;
} else if ( ! strcmp ( name , " user " ) ) {
F2FS_OPTION ( sbi ) . compress_mode = COMPR_MODE_USER ;
} else {
kfree ( name ) ;
return - EINVAL ;
}
kfree ( name ) ;
break ;
2021-05-20 19:51:50 +08:00
case Opt_compress_cache :
set_opt ( sbi , COMPRESS_CACHE ) ;
break ;
2020-07-29 21:21:36 +08:00
# else
case Opt_compress_algorithm :
case Opt_compress_log_size :
case Opt_compress_extension :
2021-06-08 19:15:08 +08:00
case Opt_nocompress_extension :
2020-11-26 18:32:09 +08:00
case Opt_compress_chksum :
2020-12-01 13:08:02 +09:00
case Opt_compress_mode :
2021-05-20 19:51:50 +08:00
case Opt_compress_cache :
2020-07-29 21:21:36 +08:00
f2fs_info ( sbi , " compression options not supported " ) ;
break ;
# endif
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
case Opt_atgc :
set_opt ( sbi , ATGC ) ;
break ;
2021-03-27 17:57:06 +08:00
case Opt_gc_merge :
set_opt ( sbi , GC_MERGE ) ;
break ;
case Opt_nogc_merge :
clear_opt ( sbi , GC_MERGE ) ;
break ;
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
case Opt_discard_unit :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
if ( ! strcmp ( name , " block " ) ) {
F2FS_OPTION ( sbi ) . discard_unit =
DISCARD_UNIT_BLOCK ;
} else if ( ! strcmp ( name , " segment " ) ) {
F2FS_OPTION ( sbi ) . discard_unit =
DISCARD_UNIT_SEGMENT ;
} else if ( ! strcmp ( name , " section " ) ) {
F2FS_OPTION ( sbi ) . discard_unit =
DISCARD_UNIT_SECTION ;
} else {
kfree ( name ) ;
return - EINVAL ;
}
kfree ( name ) ;
break ;
2022-06-20 10:38:42 -07:00
case Opt_memory_mode :
name = match_strdup ( & args [ 0 ] ) ;
if ( ! name )
return - ENOMEM ;
if ( ! strcmp ( name , " normal " ) ) {
F2FS_OPTION ( sbi ) . memory_mode =
MEMORY_MODE_NORMAL ;
} else if ( ! strcmp ( name , " low " ) ) {
F2FS_OPTION ( sbi ) . memory_mode =
MEMORY_MODE_LOW ;
} else {
kfree ( name ) ;
return - EINVAL ;
}
kfree ( name ) ;
break ;
2022-12-01 17:37:15 -08:00
case Opt_age_extent_cache :
set_opt ( sbi , AGE_EXTENT_CACHE ) ;
break ;
2013-06-16 09:48:48 +09:00
default :
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Unrecognized mount option \" %s \" or missing value " ,
p ) ;
2013-06-16 09:48:48 +09:00
return - EINVAL ;
}
}
2021-05-21 01:32:53 -07:00
default_check :
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
if ( f2fs_check_quota_options ( sbi ) )
return - EINVAL ;
2018-07-24 20:17:52 +08:00
# else
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) & & ! f2fs_readonly ( sbi - > sb ) ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Filesystem with quota feature cannot be mounted RDWR without CONFIG_QUOTA " ) ;
2018-07-24 20:17:52 +08:00
return - EINVAL ;
}
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_project_quota ( sbi ) & & ! f2fs_readonly ( sbi - > sb ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Filesystem with project quota feature cannot be mounted RDWR without CONFIG_QUOTA " ) ;
2018-07-26 07:19:48 +08:00
return - EINVAL ;
}
2017-08-08 10:54:31 +08:00
# endif
2022-01-18 07:56:14 +01:00
# if !IS_ENABLED(CONFIG_UNICODE)
2019-07-23 16:05:28 -07:00
if ( f2fs_sb_has_casefold ( sbi ) ) {
f2fs_err ( sbi ,
" Filesystem with casefold feature cannot be mounted without CONFIG_UNICODE " ) ;
return - EINVAL ;
}
# endif
2020-09-21 20:53:14 +08:00
/*
* The BLKZONED feature indicates that the drive was formatted with
* zone alignment optimization . This is optional for host - aware
* devices , but mandatory for host - managed zoned block devices .
*/
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
if ( f2fs_sb_has_blkzoned ( sbi ) ) {
2022-11-29 20:29:28 +08:00
# ifdef CONFIG_BLK_DEV_ZONED
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
if ( F2FS_OPTION ( sbi ) . discard_unit ! =
DISCARD_UNIT_SECTION ) {
f2fs_info ( sbi , " Zoned block device doesn't need small discard, set discard_unit=section by default " ) ;
F2FS_OPTION ( sbi ) . discard_unit =
DISCARD_UNIT_SECTION ;
}
2022-11-29 20:29:28 +08:00
# else
f2fs_err ( sbi , " Zoned block device support is not enabled " ) ;
return - EINVAL ;
# endif
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
}
2016-12-21 17:09:19 -08:00
2021-06-08 19:15:08 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
if ( f2fs_test_compress_extension ( sbi ) ) {
f2fs_err ( sbi , " invalid compress or nocompress extension " ) ;
return - EINVAL ;
}
# endif
2020-02-14 17:44:12 +08:00
if ( F2FS_IO_SIZE_BITS ( sbi ) & & ! f2fs_lfs_mode ( sbi ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Should set mode=lfs with %uKB-sized IO " ,
F2FS_IO_SIZE_KB ( sbi ) ) ;
2016-12-21 17:09:19 -08:00
return - EINVAL ;
}
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
if ( test_opt ( sbi , INLINE_XATTR_SIZE ) ) {
2019-03-12 11:49:53 -07:00
int min_size , max_size ;
2018-10-24 18:34:26 +08:00
if ( ! f2fs_sb_has_extra_attr ( sbi ) | |
! f2fs_sb_has_flexible_inline_xattr ( sbi ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " extra_attr or flexible_inline_xattr feature is off " ) ;
2018-01-27 17:29:48 +08:00
return - EINVAL ;
}
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
if ( ! test_opt ( sbi , INLINE_XATTR ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " inline_xattr_size option should be set with inline_xattr option " ) ;
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
return - EINVAL ;
}
2019-03-12 11:49:53 -07:00
min_size = sizeof ( struct f2fs_xattr_header ) / sizeof ( __le32 ) ;
2019-03-04 17:19:04 +08:00
max_size = MAX_INLINE_XATTR_SIZE ;
2019-03-12 11:49:53 -07:00
if ( F2FS_OPTION ( sbi ) . inline_xattr_size < min_size | |
F2FS_OPTION ( sbi ) . inline_xattr_size > max_size ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " inline xattr size is out of range: %d ~ %d " ,
min_size , max_size ) ;
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
return - EINVAL ;
}
}
2018-01-31 11:36:57 +09:00
2020-02-14 17:44:12 +08:00
if ( test_opt ( sbi , DISABLE_CHECKPOINT ) & & f2fs_lfs_mode ( sbi ) ) {
2023-02-06 22:43:08 +08:00
f2fs_err ( sbi , " LFS is not compatible with checkpoint=disable " ) ;
2018-08-20 19:21:43 -07:00
return - EINVAL ;
}
2022-08-11 15:53:34 -07:00
if ( test_opt ( sbi , ATGC ) & & f2fs_lfs_mode ( sbi ) ) {
2023-02-06 22:43:08 +08:00
f2fs_err ( sbi , " LFS is not compatible with ATGC " ) ;
2022-08-11 15:53:34 -07:00
return - EINVAL ;
}
2022-11-24 10:48:42 +08:00
if ( f2fs_is_readonly ( sbi ) & & test_opt ( sbi , FLUSH_MERGE ) ) {
2022-11-10 17:15:01 +08:00
f2fs_err ( sbi , " FLUSH_MERGE not compatible with readonly mode " ) ;
return - EINVAL ;
}
2021-05-21 01:32:53 -07:00
if ( f2fs_sb_has_readonly ( sbi ) & & ! f2fs_readonly ( sbi - > sb ) ) {
f2fs_err ( sbi , " Allow to mount readonly mode only " ) ;
return - EROFS ;
}
2013-06-16 09:48:48 +09:00
return 0 ;
}
2012-11-02 17:07:47 +09:00
static struct inode * f2fs_alloc_inode ( struct super_block * sb )
{
struct f2fs_inode_info * fi ;
2022-12-21 02:39:04 +08:00
if ( time_to_inject ( F2FS_SB ( sb ) , FAULT_SLAB_ALLOC ) )
2022-03-22 14:41:06 -07:00
return NULL ;
fi = alloc_inode_sb ( sb , f2fs_inode_cachep , GFP_F2FS_ZERO ) ;
2012-11-02 17:07:47 +09:00
if ( ! fi )
return NULL ;
init_once ( ( void * ) fi ) ;
2013-03-19 08:03:35 +09:00
/* Initialize f2fs-specific inode info */
2016-12-02 15:11:32 -08:00
atomic_set ( & fi - > dirty_pages , 0 ) ;
2020-09-08 11:44:10 +09:00
atomic_set ( & fi - > i_compr_blocks , 0 ) ;
2022-01-07 12:48:44 -08:00
init_f2fs_rwsem ( & fi - > i_sem ) ;
2020-02-27 19:30:03 +08:00
spin_lock_init ( & fi - > i_size_lock ) ;
2015-12-15 13:30:45 +08:00
INIT_LIST_HEAD ( & fi - > dirty_list ) ;
2016-05-20 11:10:10 -07:00
INIT_LIST_HEAD ( & fi - > gdirty_list ) ;
2022-01-07 12:48:44 -08:00
init_f2fs_rwsem ( & fi - > i_gc_rwsem [ READ ] ) ;
init_f2fs_rwsem ( & fi - > i_gc_rwsem [ WRITE ] ) ;
init_f2fs_rwsem ( & fi - > i_xattr_sem ) ;
2012-11-02 17:07:47 +09:00
2014-02-27 20:09:05 +09:00
/* Will be used by directory only */
fi - > i_dir_level = F2FS_SB ( sb ) - > dir_level ;
2017-07-19 00:19:05 +08:00
2012-11-02 17:07:47 +09:00
return & fi - > vfs_inode ;
}
2013-04-30 11:33:27 +09:00
static int f2fs_drop_inode ( struct inode * inode )
{
2019-07-18 16:39:59 +08:00
struct f2fs_sb_info * sbi = F2FS_I_SB ( inode ) ;
2017-02-27 13:02:58 +00:00
int ret ;
2019-07-18 16:39:59 +08:00
/*
* during filesystem shutdown , if checkpoint is disabled ,
* drop useless meta / node dirty pages .
*/
if ( unlikely ( is_sbi_flag_set ( sbi , SBI_CP_DISABLED ) ) ) {
if ( inode - > i_ino = = F2FS_NODE_INO ( sbi ) | |
inode - > i_ino = = F2FS_META_INO ( sbi ) ) {
trace_f2fs_drop_inode ( inode , 1 ) ;
return 1 ;
}
}
2013-04-30 11:33:27 +09:00
/*
* This is to avoid a deadlock condition like below .
* writeback_single_inode ( inode )
* - f2fs_write_data_page
* - f2fs_gc - > iput - > evict
* - inode_wait_for_writeback ( inode )
*/
2016-05-20 11:10:10 -07:00
if ( ( ! inode_unhashed ( inode ) & & inode - > i_state & I_SYNC ) ) {
2015-05-13 14:35:14 -07:00
if ( ! inode - > i_nlink & & ! is_bad_inode ( inode ) ) {
2015-06-19 17:53:26 -07:00
/* to avoid evict_inode call simultaneously */
atomic_inc ( & inode - > i_count ) ;
2015-05-13 14:35:14 -07:00
spin_unlock ( & inode - > i_lock ) ;
2015-06-19 17:53:26 -07:00
/* should remain fi->extent_tree for writepage */
f2fs_destroy_extent_node ( inode ) ;
2015-05-13 14:35:14 -07:00
sb_start_intwrite ( inode - > i_sb ) ;
2016-05-20 09:22:03 -07:00
f2fs_i_size_write ( inode , 0 ) ;
2015-05-13 14:35:14 -07:00
2019-03-04 09:32:25 +08:00
f2fs_submit_merged_write_cond ( F2FS_I_SB ( inode ) ,
inode , NULL , 0 , DATA ) ;
truncate_inode_pages_final ( inode - > i_mapping ) ;
2015-05-13 14:35:14 -07:00
if ( F2FS_HAS_BLOCKS ( inode ) )
2016-06-02 13:49:38 -07:00
f2fs_truncate ( inode ) ;
2015-05-13 14:35:14 -07:00
sb_end_intwrite ( inode - > i_sb ) ;
spin_lock ( & inode - > i_lock ) ;
2015-06-19 17:53:26 -07:00
atomic_dec ( & inode - > i_count ) ;
2015-05-13 14:35:14 -07:00
}
2017-02-27 13:02:58 +00:00
trace_f2fs_drop_inode ( inode , 0 ) ;
2013-04-30 11:33:27 +09:00
return 0 ;
2015-05-13 14:35:14 -07:00
}
2017-02-27 13:02:58 +00:00
ret = generic_drop_inode ( inode ) ;
2019-08-04 19:35:48 -07:00
if ( ! ret )
ret = fscrypt_drop_inode ( inode ) ;
2017-02-27 13:02:58 +00:00
trace_f2fs_drop_inode ( inode , ret ) ;
return ret ;
2013-04-30 11:33:27 +09:00
}
2016-10-14 11:51:23 -07:00
int f2fs_inode_dirtied ( struct inode * inode , bool sync )
2013-06-10 09:17:01 +09:00
{
2016-05-20 11:10:10 -07:00
struct f2fs_sb_info * sbi = F2FS_I_SB ( inode ) ;
2016-10-14 11:51:23 -07:00
int ret = 0 ;
2016-05-20 11:10:10 -07:00
spin_lock ( & sbi - > inode_lock [ DIRTY_META ] ) ;
if ( is_inode_flag_set ( inode , FI_DIRTY_INODE ) ) {
2016-10-14 11:51:23 -07:00
ret = 1 ;
} else {
set_inode_flag ( inode , FI_DIRTY_INODE ) ;
stat_inc_dirty_inode ( sbi , DIRTY_META ) ;
2016-05-20 11:10:10 -07:00
}
2016-10-14 11:51:23 -07:00
if ( sync & & list_empty ( & F2FS_I ( inode ) - > gdirty_list ) ) {
list_add_tail ( & F2FS_I ( inode ) - > gdirty_list ,
2016-05-20 11:10:10 -07:00
& sbi - > inode_list [ DIRTY_META ] ) ;
2016-10-14 11:51:23 -07:00
inc_page_count ( sbi , F2FS_DIRTY_IMETA ) ;
}
2016-06-02 11:08:56 -07:00
spin_unlock ( & sbi - > inode_lock [ DIRTY_META ] ) ;
2016-10-14 11:51:23 -07:00
return ret ;
2016-05-20 11:10:10 -07:00
}
void f2fs_inode_synced ( struct inode * inode )
{
struct f2fs_sb_info * sbi = F2FS_I_SB ( inode ) ;
spin_lock ( & sbi - > inode_lock [ DIRTY_META ] ) ;
if ( ! is_inode_flag_set ( inode , FI_DIRTY_INODE ) ) {
spin_unlock ( & sbi - > inode_lock [ DIRTY_META ] ) ;
return ;
}
2016-10-14 11:51:23 -07:00
if ( ! list_empty ( & F2FS_I ( inode ) - > gdirty_list ) ) {
list_del_init ( & F2FS_I ( inode ) - > gdirty_list ) ;
dec_page_count ( sbi , F2FS_DIRTY_IMETA ) ;
}
2016-05-20 11:10:10 -07:00
clear_inode_flag ( inode , FI_DIRTY_INODE ) ;
2016-05-20 20:42:37 -07:00
clear_inode_flag ( inode , FI_AUTO_RECOVER ) ;
2016-05-20 11:10:10 -07:00
stat_dec_dirty_inode ( F2FS_I_SB ( inode ) , DIRTY_META ) ;
2016-06-02 11:08:56 -07:00
spin_unlock ( & sbi - > inode_lock [ DIRTY_META ] ) ;
2013-06-10 09:17:01 +09:00
}
2016-06-30 19:09:37 -07:00
/*
* f2fs_dirty_inode ( ) is called from __mark_inode_dirty ( )
*
* We should call set_dirty_inode to write the dirty inode through write_inode .
*/
static void f2fs_dirty_inode ( struct inode * inode , int flags )
{
struct f2fs_sb_info * sbi = F2FS_I_SB ( inode ) ;
if ( inode - > i_ino = = F2FS_NODE_INO ( sbi ) | |
inode - > i_ino = = F2FS_META_INO ( sbi ) )
return ;
if ( is_inode_flag_set ( inode , FI_AUTO_RECOVER ) )
clear_inode_flag ( inode , FI_AUTO_RECOVER ) ;
2016-10-14 11:51:23 -07:00
f2fs_inode_dirtied ( inode , false ) ;
2016-06-30 19:09:37 -07:00
}
2019-04-15 19:29:14 -04:00
static void f2fs_free_inode ( struct inode * inode )
2012-11-02 17:07:47 +09:00
{
2019-04-10 13:21:15 -07:00
fscrypt_free_inode ( inode ) ;
2012-11-02 17:07:47 +09:00
kmem_cache_free ( f2fs_inode_cachep , F2FS_I ( inode ) ) ;
}
2016-05-13 12:36:58 -07:00
static void destroy_percpu_info ( struct f2fs_sb_info * sbi )
{
2016-05-16 11:42:32 -07:00
percpu_counter_destroy ( & sbi - > total_valid_inode_count ) ;
2022-01-27 13:31:43 -08:00
percpu_counter_destroy ( & sbi - > rf_node_block_count ) ;
percpu_counter_destroy ( & sbi - > alloc_valid_block_count ) ;
2016-05-13 12:36:58 -07:00
}
2016-10-06 19:02:05 -07:00
static void destroy_device_list ( struct f2fs_sb_info * sbi )
{
int i ;
for ( i = 0 ; i < sbi - > s_ndevs ; i + + ) {
blkdev_put ( FDEV ( i ) . bdev , FMODE_EXCL ) ;
# ifdef CONFIG_BLK_DEV_ZONED
2019-03-16 09:13:07 +09:00
kvfree ( FDEV ( i ) . blkz_seq ) ;
2016-10-06 19:02:05 -07:00
# endif
}
2018-12-13 18:38:33 -08:00
kvfree ( sbi - > devs ) ;
2016-10-06 19:02:05 -07:00
}
2012-11-02 17:07:47 +09:00
static void f2fs_put_super ( struct super_block * sb )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2017-06-14 17:39:46 +08:00
int i ;
2023-01-13 03:14:04 +08:00
bool done ;
2012-11-02 17:07:47 +09:00
2020-07-24 09:38:11 +08:00
/* unregister procfs/sysfs entries in advance to avoid race case */
f2fs_unregister_sysfs ( sbi ) ;
2017-07-09 00:13:07 +08:00
f2fs_quota_off_umount ( sb ) ;
2012-11-02 17:07:47 +09:00
2015-06-19 12:01:21 -07:00
/* prevent remaining shrinker jobs */
mutex_lock ( & sbi - > umount_mutex ) ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
/*
* flush all issued checkpoints and stop checkpoint issue thread .
* after then , all checkpoints should be done by each process context .
*/
f2fs_stop_ckpt_thread ( sbi ) ;
2015-01-14 17:41:41 -08:00
/*
* We don ' t need to do checkpoint when superblock is clean .
* But , the previous checkpoint was not done by umount , it needs to do
* clean checkpoint again .
*/
2018-08-20 19:21:43 -07:00
if ( ( is_sbi_flag_set ( sbi , SBI_IS_DIRTY ) | |
! is_set_ckpt_flags ( sbi , CP_UMOUNT_FLAG ) ) ) {
2014-09-20 21:57:51 -07:00
struct cp_control cpc = {
. reason = CP_UMOUNT ,
} ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_write_checkpoint ( sbi , & cpc ) ;
2014-09-20 21:57:51 -07:00
}
2012-11-02 17:07:47 +09:00
2016-12-29 14:07:53 -08:00
/* be sure to wait for any on-going discard commands */
2023-01-13 03:14:04 +08:00
done = f2fs_issue_discard_timeout ( sbi ) ;
if ( f2fs_realtime_discard_enable ( sbi ) & & ! sbi - > discard_blks & & done ) {
2017-04-28 13:56:08 +08:00
struct cp_control cpc = {
. reason = CP_UMOUNT | CP_TRIMMED ,
} ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_write_checkpoint ( sbi , & cpc ) ;
2017-04-28 13:56:08 +08:00
}
2014-08-11 18:37:46 -07:00
/*
* normally superblock is clean , so we need to release this .
* In addition , EIO will skip do checkpoint , we need this as well .
*/
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_release_ino_entry ( sbi , true ) ;
2014-08-19 09:48:22 -07:00
2015-06-19 12:01:21 -07:00
f2fs_leave_shrinker ( sbi ) ;
mutex_unlock ( & sbi - > umount_mutex ) ;
2016-01-29 08:57:59 -08:00
/* our cp_error case, we can wait for any writeback page */
2017-05-10 11:28:38 -07:00
f2fs_flush_merged_writes ( sbi ) ;
2016-01-29 08:57:59 -08:00
2020-02-18 09:19:07 +05:30
f2fs_wait_on_all_pages ( sbi , F2FS_WB_CP_DATA ) ;
f2fs: fix to avoid broken of dnode block list
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.
By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.
Sheng Yong helps to do the test with this patch:
Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8
1 / PERSIST / Index
Base:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 867.82 204.15 41440.03 41370.54 680.8 1025.94 1031.08
2 871.87 205.87 41370.3 40275.2 791.14 1065.84 1101.7
3 866.52 205.69 41795.67 40596.16 694.69 1037.16 1031.48
Avg 868.7366667 205.2366667 41535.33333 40747.3 722.21 1042.98 1054.753333
After:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 798.81 202.5 41143 40613.87 602.71 838.08 913.83
2 805.79 206.47 40297.2 41291.46 604.44 840.75 924.27
3 814.83 206.17 41209.57 40453.62 602.85 834.66 927.91
Avg 806.4766667 205.0466667 40883.25667 40786.31667 603.3333333 837.83 922.0033333
Patched/Original:
0.928332713 0.999074239 0.984300676 1.000957528 0.835398753 0.803303994 0.874141189
It looks like atomic write will suffer performance regression.
I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.
BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:
- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
- writeback data: P1, P2, P3;
- writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
- preflush + fua;
- power-cut
If we don't wait dnode writeback for atomic_write:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 779.91 206.03 41621.5 40333.16 716.9 1038.21 1034.85
2 848.51 204.35 40082.44 39486.17 791.83 1119.96 1083.77
3 772.12 206.27 41335.25 41599.65 723.29 1055.07 971.92
Avg 800.18 205.55 41013.06333 40472.99333 744.0066667 1071.08 1030.18
Patched/Original:
0.92108464 1.001526693 0.987425886 0.993268102 1.030180511 1.026942031 0.976702294
SQLite's performance recovers.
Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."
Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-02 23:03:19 +08:00
f2fs_bug_on ( sbi , sbi - > fsync_node_num ) ;
2021-05-20 19:51:50 +08:00
f2fs_destroy_compress_inode ( sbi ) ;
2012-11-02 17:07:47 +09:00
iput ( sbi - > node_inode ) ;
2019-01-01 00:11:30 -08:00
sbi - > node_inode = NULL ;
2012-11-02 17:07:47 +09:00
iput ( sbi - > meta_inode ) ;
2019-01-01 00:11:30 -08:00
sbi - > meta_inode = NULL ;
2012-11-02 17:07:47 +09:00
2018-12-26 11:20:29 +05:30
/*
* iput ( ) can update stat information , if f2fs_write_checkpoint ( )
* above failed with error .
*/
f2fs_destroy_stats ( sbi ) ;
2012-11-02 17:07:47 +09:00
/* destroy f2fs internal modules */
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_node_manager ( sbi ) ;
f2fs_destroy_segment_manager ( sbi ) ;
2012-11-02 17:07:47 +09:00
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
f2fs_destroy_post_read_wq ( sbi ) ;
2018-12-13 18:38:33 -08:00
kvfree ( sbi - > ckpt ) ;
2017-06-14 17:39:46 +08:00
2012-11-02 17:07:47 +09:00
sb - > s_fs_info = NULL ;
2016-03-02 12:04:24 -08:00
if ( sbi - > s_chksum_driver )
crypto_free_shash ( sbi - > s_chksum_driver ) ;
2020-06-10 01:14:46 +03:00
kfree ( sbi - > raw_super ) ;
2016-05-13 12:36:58 -07:00
2016-10-06 19:02:05 -07:00
destroy_device_list ( sbi ) ;
2020-09-14 17:05:13 +08:00
f2fs_destroy_page_array_cache ( sbi ) ;
2020-02-25 18:17:10 +08:00
f2fs_destroy_xattr_caches ( sbi ) ;
2017-02-27 18:43:12 +08:00
mempool_destroy ( sbi - > write_io_dummy ) ;
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
for ( i = 0 ; i < MAXQUOTAS ; i + + )
2020-06-17 20:30:12 +08:00
kfree ( F2FS_OPTION ( sbi ) . s_qf_names [ i ] ) ;
2017-08-08 10:54:31 +08:00
# endif
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-16 21:11:35 -07:00
fscrypt_free_dummy_policy ( & F2FS_OPTION ( sbi ) . dummy_enc_policy ) ;
2016-05-13 12:36:58 -07:00
destroy_percpu_info ( sbi ) ;
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
f2fs_destroy_iostat ( sbi ) ;
2017-05-10 11:18:25 -07:00
for ( i = 0 ; i < NR_PAGE_TYPE ; i + + )
2018-12-13 18:38:33 -08:00
kvfree ( sbi - > write_io [ i ] ) ;
2022-01-18 07:56:14 +01:00
# if IS_ENABLED(CONFIG_UNICODE)
2020-07-08 02:12:36 -07:00
utf8_unload ( sb - > s_encoding ) ;
2019-07-23 16:05:28 -07:00
# endif
2020-06-10 01:14:46 +03:00
kfree ( sbi ) ;
2012-11-02 17:07:47 +09:00
}
int f2fs_sync_fs ( struct super_block * sb , int sync )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2015-12-23 17:50:30 +08:00
int err = 0 ;
2012-11-02 17:07:47 +09:00
2017-10-23 23:48:49 +02:00
if ( unlikely ( f2fs_cp_error ( sbi ) ) )
return 0 ;
2018-08-20 19:21:43 -07:00
if ( unlikely ( is_sbi_flag_set ( sbi , SBI_CP_DISABLED ) ) )
return 0 ;
2017-10-23 23:48:49 +02:00
2013-04-20 01:28:40 +09:00
trace_f2fs_sync_fs ( sb , sync ) ;
2017-08-08 10:54:31 +08:00
if ( unlikely ( is_sbi_flag_set ( sbi , SBI_POR_DOING ) ) )
return - EAGAIN ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
if ( sync )
err = f2fs_issue_checkpoint ( sbi ) ;
2015-01-29 11:45:33 -08:00
2015-12-23 17:50:30 +08:00
return err ;
2012-11-02 17:07:47 +09:00
}
2013-01-29 18:30:07 +09:00
static int f2fs_freeze ( struct super_block * sb )
{
2013-05-20 20:28:47 +09:00
if ( f2fs_readonly ( sb ) )
2013-01-29 18:30:07 +09:00
return 0 ;
2016-11-04 14:59:15 -07:00
/* IO error happened before */
if ( unlikely ( f2fs_cp_error ( F2FS_SB ( sb ) ) ) )
return - EIO ;
/* must be clean, since sync_filesystem() was already called */
if ( is_sbi_flag_set ( F2FS_SB ( sb ) , SBI_IS_DIRTY ) )
return - EINVAL ;
2021-02-08 13:42:21 -08:00
2022-08-19 15:52:02 -07:00
/* Let's flush checkpoints and stop the thread. */
f2fs_flush_ckpt_thread ( F2FS_SB ( sb ) ) ;
2022-03-04 09:40:05 -08:00
/* to avoid deadlock on f2fs_evict_inode->SB_FREEZE_FS */
set_sbi_flag ( F2FS_SB ( sb ) , SBI_IS_FREEZING ) ;
2016-11-04 14:59:15 -07:00
return 0 ;
2013-01-29 18:30:07 +09:00
}
static int f2fs_unfreeze ( struct super_block * sb )
{
2022-03-04 09:40:05 -08:00
clear_sbi_flag ( F2FS_SB ( sb ) , SBI_IS_FREEZING ) ;
2013-01-29 18:30:07 +09:00
return 0 ;
}
2017-07-29 00:32:53 +08:00
# ifdef CONFIG_QUOTA
static int f2fs_statfs_project ( struct super_block * sb ,
kprojid_t projid , struct kstatfs * buf )
{
struct kqid qid ;
struct dquot * dquot ;
u64 limit ;
u64 curblock ;
qid = make_kqid_projid ( projid ) ;
dquot = dqget ( sb , qid ) ;
if ( IS_ERR ( dquot ) )
return PTR_ERR ( dquot ) ;
2018-07-24 20:17:53 +08:00
spin_lock ( & dquot - > dq_dqb_lock ) ;
2017-07-29 00:32:53 +08:00
2020-01-04 22:20:04 +08:00
limit = min_not_zero ( dquot - > dq_dqb . dqb_bsoftlimit ,
dquot - > dq_dqb . dqb_bhardlimit ) ;
f2fs: fix miscounted block limit in f2fs_statfs_project()
statfs calculates Total/Used/Avail disk space in block unit,
so we should translate soft/hard prjquota limit to block unit
as well.
Below testing result shows the block/inode numbers of
Total/Used/Avail from df command are all correct afer
applying this patch.
[root@localhost quota-tools]\# ./repquota -P /dev/sdb1
*** Report for project quotas on device /dev/sdb1
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
Project used soft hard grace used soft hard grace
-----------------------------------------------------------
\#0 -- 4 0 0 1 0 0
\#101 -- 0 0 0 2 0 0
\#102 -- 0 10240 0 2 10 0
\#103 -- 0 0 20480 2 0 20
\#104 -- 0 10240 20480 2 10 20
\#105 -- 0 20480 10240 2 20 10
[root@localhost sdb1]\# lsattr -p t{1,2,3,4,5}
101 ----------------N-- t1/a1
102 ----------------N-- t2/a2
103 ----------------N-- t3/a3
104 ----------------N-- t4/a4
105 ----------------N-- t5/a5
[root@localhost sdb1]\# df -hi t{1,2,3,4,5}
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdb1 2.4M 21 2.4M 1% /mnt/sdb1
/dev/sdb1 10 2 8 20% /mnt/sdb1
/dev/sdb1 20 2 18 10% /mnt/sdb1
/dev/sdb1 10 2 8 20% /mnt/sdb1
/dev/sdb1 10 2 8 20% /mnt/sdb1
[root@localhost sdb1]\# df -h t{1,2,3,4,5}
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 10G 489M 9.6G 5% /mnt/sdb1
/dev/sdb1 10M 0 10M 0% /mnt/sdb1
/dev/sdb1 20M 0 20M 0% /mnt/sdb1
/dev/sdb1 10M 0 10M 0% /mnt/sdb1
/dev/sdb1 10M 0 10M 0% /mnt/sdb1
Fixes: 909110c060f2 ("f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()")
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-04 22:20:03 +08:00
if ( limit )
limit > > = sb - > s_blocksize_bits ;
2019-11-25 11:20:36 +08:00
2017-07-29 00:32:53 +08:00
if ( limit & & buf - > f_blocks > limit ) {
2020-05-11 09:15:18 +03:00
curblock = ( dquot - > dq_dqb . dqb_curspace +
dquot - > dq_dqb . dqb_rsvspace ) > > sb - > s_blocksize_bits ;
2017-07-29 00:32:53 +08:00
buf - > f_blocks = limit ;
buf - > f_bfree = buf - > f_bavail =
( buf - > f_blocks > curblock ) ?
( buf - > f_blocks - curblock ) : 0 ;
}
2020-01-04 22:20:04 +08:00
limit = min_not_zero ( dquot - > dq_dqb . dqb_isoftlimit ,
dquot - > dq_dqb . dqb_ihardlimit ) ;
2019-11-25 11:20:36 +08:00
2017-07-29 00:32:53 +08:00
if ( limit & & buf - > f_files > limit ) {
buf - > f_files = limit ;
buf - > f_ffree =
( buf - > f_files > dquot - > dq_dqb . dqb_curinodes ) ?
( buf - > f_files - dquot - > dq_dqb . dqb_curinodes ) : 0 ;
}
2018-07-24 20:17:53 +08:00
spin_unlock ( & dquot - > dq_dqb_lock ) ;
2017-07-29 00:32:53 +08:00
dqput ( dquot ) ;
return 0 ;
}
# endif
2012-11-02 17:07:47 +09:00
static int f2fs_statfs ( struct dentry * dentry , struct kstatfs * buf )
{
struct super_block * sb = dentry - > d_sb ;
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
u64 id = huge_encode_dev ( sb - > s_bdev - > bd_dev ) ;
2018-01-03 10:55:07 -08:00
block_t total_count , user_block_count , start_count ;
2017-06-21 20:55:55 -07:00
u64 avail_node_count ;
2022-04-22 20:05:04 +02:00
unsigned int total_valid_node_count ;
2012-11-02 17:07:47 +09:00
total_count = le64_to_cpu ( sbi - > raw_super - > block_count ) ;
start_count = le32_to_cpu ( sbi - > raw_super - > segment0_blkaddr ) ;
buf - > f_type = F2FS_SUPER_MAGIC ;
buf - > f_bsize = sbi - > blocksize ;
buf - > f_blocks = total_count - start_count ;
2022-04-22 20:05:04 +02:00
spin_lock ( & sbi - > stat_lock ) ;
user_block_count = sbi - > user_block_count ;
total_valid_node_count = valid_node_count ( sbi ) ;
avail_node_count = sbi - > total_node_count - F2FS_RESERVED_NODE_NUM ;
2018-01-03 10:55:07 -08:00
buf - > f_bfree = user_block_count - valid_user_blocks ( sbi ) -
2017-10-27 20:45:05 +08:00
sbi - > current_reserved_blocks ;
2019-05-05 11:40:46 +08:00
2018-08-20 19:21:43 -07:00
if ( unlikely ( buf - > f_bfree < = sbi - > unusable_block_count ) )
buf - > f_bfree = 0 ;
else
buf - > f_bfree - = sbi - > unusable_block_count ;
2019-05-05 11:40:46 +08:00
spin_unlock ( & sbi - > stat_lock ) ;
2018-08-20 19:21:43 -07:00
2018-03-08 14:22:56 +08:00
if ( buf - > f_bfree > F2FS_OPTION ( sbi ) . root_reserved_blocks )
buf - > f_bavail = buf - > f_bfree -
F2FS_OPTION ( sbi ) . root_reserved_blocks ;
2017-12-27 15:05:52 -08:00
else
buf - > f_bavail = 0 ;
2012-11-02 17:07:47 +09:00
2017-06-21 20:55:55 -07:00
if ( avail_node_count > user_block_count ) {
buf - > f_files = user_block_count ;
buf - > f_ffree = buf - > f_bavail ;
} else {
buf - > f_files = avail_node_count ;
2022-04-22 20:05:04 +02:00
buf - > f_ffree = min ( avail_node_count - total_valid_node_count ,
2017-06-21 20:55:55 -07:00
buf - > f_bavail ) ;
}
2012-11-02 17:07:47 +09:00
2013-03-03 13:58:05 +09:00
buf - > f_namelen = F2FS_NAME_LEN ;
2020-09-18 16:45:50 -04:00
buf - > f_fsid = u64_to_fsid ( id ) ;
2012-11-02 17:07:47 +09:00
2017-07-29 00:32:53 +08:00
# ifdef CONFIG_QUOTA
if ( is_inode_flag_set ( dentry - > d_inode , FI_PROJ_INHERIT ) & &
sb_has_quota_limits_enabled ( sb , PRJQUOTA ) ) {
f2fs_statfs_project ( sb , F2FS_I ( dentry - > d_inode ) - > i_projid , buf ) ;
}
# endif
2012-11-02 17:07:47 +09:00
return 0 ;
}
2017-08-08 10:54:31 +08:00
static inline void f2fs_show_quota_options ( struct seq_file * seq ,
struct super_block * sb )
{
# ifdef CONFIG_QUOTA
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_jquota_fmt ) {
2017-08-08 10:54:31 +08:00
char * fmtname = " " ;
2018-03-08 14:22:56 +08:00
switch ( F2FS_OPTION ( sbi ) . s_jquota_fmt ) {
2017-08-08 10:54:31 +08:00
case QFMT_VFS_OLD :
fmtname = " vfsold " ;
break ;
case QFMT_VFS_V0 :
fmtname = " vfsv0 " ;
break ;
case QFMT_VFS_V1 :
fmtname = " vfsv1 " ;
break ;
}
seq_printf ( seq , " ,jqfmt=%s " , fmtname ) ;
}
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ USRQUOTA ] )
seq_show_option ( seq , " usrjquota " ,
F2FS_OPTION ( sbi ) . s_qf_names [ USRQUOTA ] ) ;
2017-08-08 10:54:31 +08:00
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ GRPQUOTA ] )
seq_show_option ( seq , " grpjquota " ,
F2FS_OPTION ( sbi ) . s_qf_names [ GRPQUOTA ] ) ;
2017-08-08 10:54:31 +08:00
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ PRJQUOTA ] )
seq_show_option ( seq , " prjjquota " ,
F2FS_OPTION ( sbi ) . s_qf_names [ PRJQUOTA ] ) ;
2017-08-08 10:54:31 +08:00
# endif
}
2021-02-20 17:38:41 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
static inline void f2fs_show_compress_options ( struct seq_file * seq ,
struct super_block * sb )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
char * algtype = " " ;
int i ;
if ( ! f2fs_sb_has_compression ( sbi ) )
return ;
switch ( F2FS_OPTION ( sbi ) . compress_algorithm ) {
case COMPRESS_LZO :
algtype = " lzo " ;
break ;
case COMPRESS_LZ4 :
algtype = " lz4 " ;
break ;
2020-03-03 17:46:02 +08:00
case COMPRESS_ZSTD :
algtype = " zstd " ;
break ;
2020-04-08 19:56:32 +08:00
case COMPRESS_LZORLE :
algtype = " lzo-rle " ;
break ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
}
seq_printf ( seq , " ,compress_algorithm=%s " , algtype ) ;
2021-01-22 17:46:43 +08:00
if ( F2FS_OPTION ( sbi ) . compress_level )
seq_printf ( seq , " :%d " , F2FS_OPTION ( sbi ) . compress_level ) ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
seq_printf ( seq , " ,compress_log_size=%u " ,
F2FS_OPTION ( sbi ) . compress_log_size ) ;
for ( i = 0 ; i < F2FS_OPTION ( sbi ) . compress_ext_cnt ; i + + ) {
seq_printf ( seq , " ,compress_extension=%s " ,
F2FS_OPTION ( sbi ) . extensions [ i ] ) ;
}
2020-11-26 18:32:09 +08:00
2021-06-08 19:15:08 +08:00
for ( i = 0 ; i < F2FS_OPTION ( sbi ) . nocompress_ext_cnt ; i + + ) {
seq_printf ( seq , " ,nocompress_extension=%s " ,
F2FS_OPTION ( sbi ) . noextensions [ i ] ) ;
}
2020-11-26 18:32:09 +08:00
if ( F2FS_OPTION ( sbi ) . compress_chksum )
seq_puts ( seq , " ,compress_chksum " ) ;
2020-12-01 13:08:02 +09:00
if ( F2FS_OPTION ( sbi ) . compress_mode = = COMPR_MODE_FS )
seq_printf ( seq , " ,compress_mode=%s " , " fs " ) ;
else if ( F2FS_OPTION ( sbi ) . compress_mode = = COMPR_MODE_USER )
seq_printf ( seq , " ,compress_mode=%s " , " user " ) ;
2021-05-20 19:51:50 +08:00
if ( test_opt ( sbi , COMPRESS_CACHE ) )
seq_puts ( seq , " ,compress_cache " ) ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
}
2021-02-20 17:38:41 +08:00
# endif
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
2012-11-02 17:07:47 +09:00
static int f2fs_show_options ( struct seq_file * seq , struct dentry * root )
{
struct f2fs_sb_info * sbi = F2FS_SB ( root - > d_sb ) ;
2020-02-14 17:44:13 +08:00
if ( F2FS_OPTION ( sbi ) . bggc_mode = = BGGC_MODE_SYNC )
seq_printf ( seq , " ,background_gc=%s " , " sync " ) ;
else if ( F2FS_OPTION ( sbi ) . bggc_mode = = BGGC_MODE_ON )
seq_printf ( seq , " ,background_gc=%s " , " on " ) ;
else if ( F2FS_OPTION ( sbi ) . bggc_mode = = BGGC_MODE_OFF )
2013-06-16 09:48:48 +09:00
seq_printf ( seq , " ,background_gc=%s " , " off " ) ;
2020-02-14 17:44:13 +08:00
2021-03-27 17:57:06 +08:00
if ( test_opt ( sbi , GC_MERGE ) )
seq_puts ( seq , " ,gc_merge " ) ;
2023-02-02 17:41:23 +08:00
else
seq_puts ( seq , " ,nogc_merge " ) ;
2021-03-27 17:57:06 +08:00
2012-11-02 17:07:47 +09:00
if ( test_opt ( sbi , DISABLE_ROLL_FORWARD ) )
seq_puts ( seq , " ,disable_roll_forward " ) ;
2020-02-14 17:45:11 +08:00
if ( test_opt ( sbi , NORECOVERY ) )
seq_puts ( seq , " ,norecovery " ) ;
2023-01-16 22:12:28 +08:00
if ( test_opt ( sbi , DISCARD ) ) {
2012-11-02 17:07:47 +09:00
seq_puts ( seq , " ,discard " ) ;
2023-01-16 22:12:28 +08:00
if ( F2FS_OPTION ( sbi ) . discard_unit = = DISCARD_UNIT_BLOCK )
seq_printf ( seq , " ,discard_unit=%s " , " block " ) ;
else if ( F2FS_OPTION ( sbi ) . discard_unit = = DISCARD_UNIT_SEGMENT )
seq_printf ( seq , " ,discard_unit=%s " , " segment " ) ;
else if ( F2FS_OPTION ( sbi ) . discard_unit = = DISCARD_UNIT_SECTION )
seq_printf ( seq , " ,discard_unit=%s " , " section " ) ;
} else {
2019-05-24 14:38:39 +05:30
seq_puts ( seq , " ,nodiscard " ) ;
2023-01-16 22:12:28 +08:00
}
2012-11-02 17:07:47 +09:00
if ( test_opt ( sbi , NOHEAP ) )
2017-03-24 20:41:45 -04:00
seq_puts ( seq , " ,no_heap " ) ;
else
seq_puts ( seq , " ,heap " ) ;
2012-11-02 17:07:47 +09:00
# ifdef CONFIG_F2FS_FS_XATTR
if ( test_opt ( sbi , XATTR_USER ) )
seq_puts ( seq , " ,user_xattr " ) ;
else
seq_puts ( seq , " ,nouser_xattr " ) ;
2013-08-08 15:16:22 +09:00
if ( test_opt ( sbi , INLINE_XATTR ) )
seq_puts ( seq , " ,inline_xattr " ) ;
2017-02-15 10:34:45 +08:00
else
seq_puts ( seq , " ,noinline_xattr " ) ;
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
if ( test_opt ( sbi , INLINE_XATTR_SIZE ) )
seq_printf ( seq , " ,inline_xattr_size=%u " ,
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . inline_xattr_size ) ;
2012-11-02 17:07:47 +09:00
# endif
# ifdef CONFIG_F2FS_FS_POSIX_ACL
if ( test_opt ( sbi , POSIX_ACL ) )
seq_puts ( seq , " ,acl " ) ;
else
seq_puts ( seq , " ,noacl " ) ;
# endif
if ( test_opt ( sbi , DISABLE_EXT_IDENTIFY ) )
2013-01-25 19:08:59 +01:00
seq_puts ( seq , " ,disable_ext_identify " ) ;
2013-11-10 23:13:17 +08:00
if ( test_opt ( sbi , INLINE_DATA ) )
seq_puts ( seq , " ,inline_data " ) ;
2015-03-24 10:20:27 +08:00
else
seq_puts ( seq , " ,noinline_data " ) ;
2014-09-24 18:16:13 +08:00
if ( test_opt ( sbi , INLINE_DENTRY ) )
seq_puts ( seq , " ,inline_dentry " ) ;
2016-05-09 19:56:34 +08:00
else
seq_puts ( seq , " ,noinline_dentry " ) ;
2022-11-10 17:15:01 +08:00
if ( test_opt ( sbi , FLUSH_MERGE ) )
2014-04-02 15:34:36 +09:00
seq_puts ( seq , " ,flush_merge " ) ;
2022-11-10 17:15:01 +08:00
else
seq_puts ( seq , " ,noflush_merge " ) ;
2014-07-23 09:57:31 -07:00
if ( test_opt ( sbi , NOBARRIER ) )
seq_puts ( seq , " ,nobarrier " ) ;
2022-10-25 01:54:01 +08:00
else
seq_puts ( seq , " ,barrier " ) ;
2014-10-30 22:47:03 -07:00
if ( test_opt ( sbi , FASTBOOT ) )
seq_puts ( seq , " ,fastboot " ) ;
2022-11-30 09:36:43 -08:00
if ( test_opt ( sbi , READ_EXTENT_CACHE ) )
2015-02-05 17:55:51 +08:00
seq_puts ( seq , " ,extent_cache " ) ;
2015-06-25 17:43:04 -07:00
else
seq_puts ( seq , " ,noextent_cache " ) ;
2022-12-01 17:37:15 -08:00
if ( test_opt ( sbi , AGE_EXTENT_CACHE ) )
seq_puts ( seq , " ,age_extent_cache " ) ;
2015-12-16 13:12:16 +08:00
if ( test_opt ( sbi , DATA_FLUSH ) )
seq_puts ( seq , " ,data_flush " ) ;
2016-06-03 19:29:38 -07:00
seq_puts ( seq , " ,mode= " ) ;
2020-02-14 17:44:12 +08:00
if ( F2FS_OPTION ( sbi ) . fs_mode = = FS_MODE_ADAPTIVE )
2016-06-03 19:29:38 -07:00
seq_puts ( seq , " adaptive " ) ;
2020-02-14 17:44:12 +08:00
else if ( F2FS_OPTION ( sbi ) . fs_mode = = FS_MODE_LFS )
2016-06-03 19:29:38 -07:00
seq_puts ( seq , " lfs " ) ;
2021-09-29 11:12:03 -07:00
else if ( F2FS_OPTION ( sbi ) . fs_mode = = FS_MODE_FRAGMENT_SEG )
seq_puts ( seq , " fragment:segment " ) ;
else if ( F2FS_OPTION ( sbi ) . fs_mode = = FS_MODE_FRAGMENT_BLK )
seq_puts ( seq , " fragment:block " ) ;
2018-03-08 14:22:56 +08:00
seq_printf ( seq , " ,active_logs=%u " , F2FS_OPTION ( sbi ) . active_logs ) ;
2017-12-27 15:05:52 -08:00
if ( test_opt ( sbi , RESERVE_ROOT ) )
2018-01-04 21:36:09 -08:00
seq_printf ( seq , " ,reserve_root=%u,resuid=%u,resgid=%u " ,
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . root_reserved_blocks ,
from_kuid_munged ( & init_user_ns ,
F2FS_OPTION ( sbi ) . s_resuid ) ,
from_kgid_munged ( & init_user_ns ,
F2FS_OPTION ( sbi ) . s_resgid ) ) ;
2016-12-21 17:09:19 -08:00
if ( F2FS_IO_SIZE_BITS ( sbi ) )
2018-09-22 22:43:09 +08:00
seq_printf ( seq , " ,io_bits=%u " ,
F2FS_OPTION ( sbi ) . write_io_size_bits ) ;
2017-01-27 09:35:37 +08:00
# ifdef CONFIG_F2FS_FAULT_INJECTION
2018-08-08 17:36:41 +08:00
if ( test_opt ( sbi , FAULT_INJECTION ) ) {
2017-06-12 09:44:24 +08:00
seq_printf ( seq , " ,fault_injection=%u " ,
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . fault_info . inject_rate ) ;
2018-08-08 17:36:41 +08:00
seq_printf ( seq , " ,fault_type=%u " ,
F2FS_OPTION ( sbi ) . fault_info . inject_type ) ;
}
2017-01-27 09:35:37 +08:00
# endif
2017-07-09 00:13:07 +08:00
# ifdef CONFIG_QUOTA
2017-08-08 10:54:31 +08:00
if ( test_opt ( sbi , QUOTA ) )
seq_puts ( seq , " ,quota " ) ;
2017-07-09 00:13:07 +08:00
if ( test_opt ( sbi , USRQUOTA ) )
seq_puts ( seq , " ,usrquota " ) ;
if ( test_opt ( sbi , GRPQUOTA ) )
seq_puts ( seq , " ,grpquota " ) ;
2017-07-26 00:01:41 +08:00
if ( test_opt ( sbi , PRJQUOTA ) )
seq_puts ( seq , " ,prjquota " ) ;
2017-01-27 09:35:37 +08:00
# endif
2017-08-08 10:54:31 +08:00
f2fs_show_quota_options ( seq , sbi - > sb ) ;
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
fscrypt_show_test_dummy_encryption ( seq , ' , ' , sbi - > sb ) ;
2012-11-02 17:07:47 +09:00
2020-07-02 01:56:06 +00:00
if ( sbi - > sb - > s_flags & SB_INLINECRYPT )
seq_puts ( seq , " ,inlinecrypt " ) ;
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . alloc_mode = = ALLOC_MODE_DEFAULT )
2018-02-18 08:50:49 -08:00
seq_printf ( seq , " ,alloc_mode=%s " , " default " ) ;
2018-03-08 14:22:56 +08:00
else if ( F2FS_OPTION ( sbi ) . alloc_mode = = ALLOC_MODE_REUSE )
2018-02-18 08:50:49 -08:00
seq_printf ( seq , " ,alloc_mode=%s " , " reuse " ) ;
2018-03-07 12:07:49 +08:00
2018-08-20 19:21:43 -07:00
if ( test_opt ( sbi , DISABLE_CHECKPOINT ) )
2019-05-29 17:49:06 -07:00
seq_printf ( seq , " ,checkpoint=disable:%u " ,
F2FS_OPTION ( sbi ) . unusable_cap ) ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
if ( test_opt ( sbi , MERGE_CHECKPOINT ) )
seq_puts ( seq , " ,checkpoint_merge " ) ;
else
seq_puts ( seq , " ,nocheckpoint_merge " ) ;
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . fsync_mode = = FSYNC_MODE_POSIX )
2018-03-07 12:07:49 +08:00
seq_printf ( seq , " ,fsync_mode=%s " , " posix " ) ;
2018-03-08 14:22:56 +08:00
else if ( F2FS_OPTION ( sbi ) . fsync_mode = = FSYNC_MODE_STRICT )
2018-03-07 12:07:49 +08:00
seq_printf ( seq , " ,fsync_mode=%s " , " strict " ) ;
2018-07-02 11:37:40 +05:30
else if ( F2FS_OPTION ( sbi ) . fsync_mode = = FSYNC_MODE_NOBARRIER )
seq_printf ( seq , " ,fsync_mode=%s " , " nobarrier " ) ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
2020-07-29 21:21:36 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
f2fs_show_compress_options ( seq , sbi - > sb ) ;
2020-07-29 21:21:36 +08:00
# endif
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
if ( test_opt ( sbi , ATGC ) )
seq_puts ( seq , " ,atgc " ) ;
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
2022-06-20 10:38:42 -07:00
if ( F2FS_OPTION ( sbi ) . memory_mode = = MEMORY_MODE_NORMAL )
seq_printf ( seq , " ,memory=%s " , " normal " ) ;
else if ( F2FS_OPTION ( sbi ) . memory_mode = = MEMORY_MODE_LOW )
seq_printf ( seq , " ,memory=%s " , " low " ) ;
2012-11-02 17:07:47 +09:00
return 0 ;
}
2015-05-07 18:11:37 +08:00
static void default_options ( struct f2fs_sb_info * sbi )
{
/* init some FS parameters */
2021-05-21 01:32:53 -07:00
if ( f2fs_sb_has_readonly ( sbi ) )
F2FS_OPTION ( sbi ) . active_logs = NR_CURSEG_RO_TYPE ;
else
F2FS_OPTION ( sbi ) . active_logs = NR_CURSEG_PERSIST_TYPE ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . inline_xattr_size = DEFAULT_INLINE_XATTR_ADDRS ;
2022-11-15 14:35:35 +08:00
if ( le32_to_cpu ( F2FS_RAW_SUPER ( sbi ) - > segment_count_main ) < =
SMALL_VOLUME_SEGMENTS )
F2FS_OPTION ( sbi ) . alloc_mode = ALLOC_MODE_REUSE ;
else
F2FS_OPTION ( sbi ) . alloc_mode = ALLOC_MODE_DEFAULT ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . fsync_mode = FSYNC_MODE_POSIX ;
2018-06-06 23:55:02 +08:00
F2FS_OPTION ( sbi ) . s_resuid = make_kuid ( & init_user_ns , F2FS_DEF_RESUID ) ;
F2FS_OPTION ( sbi ) . s_resgid = make_kgid ( & init_user_ns , F2FS_DEF_RESGID ) ;
2020-03-10 20:50:05 +08:00
F2FS_OPTION ( sbi ) . compress_algorithm = COMPRESS_LZ4 ;
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
F2FS_OPTION ( sbi ) . compress_log_size = MIN_COMPRESS_LOG_SIZE ;
F2FS_OPTION ( sbi ) . compress_ext_cnt = 0 ;
2020-12-01 13:08:02 +09:00
F2FS_OPTION ( sbi ) . compress_mode = COMPR_MODE_FS ;
2020-02-14 17:44:13 +08:00
F2FS_OPTION ( sbi ) . bggc_mode = BGGC_MODE_ON ;
2022-06-20 10:38:42 -07:00
F2FS_OPTION ( sbi ) . memory_mode = MEMORY_MODE_NORMAL ;
2015-05-07 18:11:37 +08:00
2020-07-02 01:56:06 +00:00
sbi - > sb - > s_flags & = ~ SB_INLINECRYPT ;
2017-02-08 17:39:44 +08:00
set_opt ( sbi , INLINE_XATTR ) ;
2015-05-07 18:11:37 +08:00
set_opt ( sbi , INLINE_DATA ) ;
2016-05-09 19:56:34 +08:00
set_opt ( sbi , INLINE_DENTRY ) ;
2022-11-30 09:36:43 -08:00
set_opt ( sbi , READ_EXTENT_CACHE ) ;
2017-03-24 20:41:45 -04:00
set_opt ( sbi , NOHEAP ) ;
2018-08-20 19:21:43 -07:00
clear_opt ( sbi , DISABLE_CHECKPOINT ) ;
2021-04-01 17:25:20 -07:00
set_opt ( sbi , MERGE_CHECKPOINT ) ;
2019-05-29 17:49:06 -07:00
F2FS_OPTION ( sbi ) . unusable_cap = 0 ;
2017-11-27 13:05:09 -08:00
sbi - > sb - > s_flags | = SB_LAZYTIME ;
2022-11-24 10:48:42 +08:00
if ( ! f2fs_is_readonly ( sbi ) )
2022-11-10 17:15:01 +08:00
set_opt ( sbi , FLUSH_MERGE ) ;
2021-08-30 08:35:33 +08:00
if ( f2fs_hw_support_discard ( sbi ) | | f2fs_hw_should_discard ( sbi ) )
set_opt ( sbi , DISCARD ) ;
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
if ( f2fs_sb_has_blkzoned ( sbi ) ) {
2020-02-14 17:44:12 +08:00
F2FS_OPTION ( sbi ) . fs_mode = FS_MODE_LFS ;
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
F2FS_OPTION ( sbi ) . discard_unit = DISCARD_UNIT_SECTION ;
} else {
2020-02-14 17:44:12 +08:00
F2FS_OPTION ( sbi ) . fs_mode = FS_MODE_ADAPTIVE ;
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
F2FS_OPTION ( sbi ) . discard_unit = DISCARD_UNIT_BLOCK ;
}
2015-05-07 18:11:37 +08:00
# ifdef CONFIG_F2FS_FS_XATTR
set_opt ( sbi , XATTR_USER ) ;
# endif
# ifdef CONFIG_F2FS_FS_POSIX_ACL
set_opt ( sbi , POSIX_ACL ) ;
# endif
2016-09-26 19:45:05 +08:00
2018-08-08 17:36:41 +08:00
f2fs_build_fault_attr ( sbi , 0 , 0 ) ;
2015-05-07 18:11:37 +08:00
}
2017-10-06 09:14:28 -07:00
# ifdef CONFIG_QUOTA
static int f2fs_enable_quotas ( struct super_block * sb ) ;
# endif
2018-08-20 19:21:43 -07:00
static int f2fs_disable_checkpoint ( struct f2fs_sb_info * sbi )
{
2019-01-22 14:04:33 -08:00
unsigned int s_flags = sbi - > sb - > s_flags ;
2018-08-20 19:21:43 -07:00
struct cp_control cpc ;
2022-05-07 00:28:14 +08:00
unsigned int gc_mode = sbi - > gc_mode ;
2019-01-22 14:04:33 -08:00
int err = 0 ;
int ret ;
2019-05-29 17:49:06 -07:00
block_t unusable ;
2018-08-20 19:21:43 -07:00
2019-01-22 14:04:33 -08:00
if ( s_flags & SB_RDONLY ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " checkpoint=disable on readonly fs " ) ;
2019-01-22 14:04:33 -08:00
return - EINVAL ;
}
2018-08-20 19:21:43 -07:00
sbi - > sb - > s_flags | = SB_ACTIVE ;
2022-05-07 00:28:14 +08:00
/* check if we need more GC first */
unusable = f2fs_get_unusable_blocks ( sbi ) ;
if ( ! f2fs_disable_cp_again ( sbi , unusable ) )
goto skip_gc ;
2018-08-20 19:21:43 -07:00
f2fs_update_time ( sbi , DISABLE_TIME ) ;
2022-03-18 17:02:30 +08:00
sbi - > gc_mode = GC_URGENT_HIGH ;
2018-08-20 19:21:43 -07:00
while ( ! f2fs_time_over ( sbi , DISABLE_TIME ) ) {
2022-05-06 11:40:33 -07:00
struct f2fs_gc_control gc_control = {
. victim_segno = NULL_SEGNO ,
. init_gc_type = FG_GC ,
. should_migrate_blocks = false ,
2022-05-06 13:34:41 -07:00
. err_gc_skipped = true ,
. nr_free_secs = 1 } ;
2022-05-06 11:40:33 -07:00
2022-01-07 12:48:44 -08:00
f2fs_down_write ( & sbi - > gc_lock ) ;
2022-05-06 11:40:33 -07:00
err = f2fs_gc ( sbi , & gc_control ) ;
2019-01-22 14:04:33 -08:00
if ( err = = - ENODATA ) {
err = 0 ;
2018-08-20 19:21:43 -07:00
break ;
2019-01-22 14:04:33 -08:00
}
2018-12-17 17:08:26 -08:00
if ( err & & err ! = - EAGAIN )
2019-01-22 14:04:33 -08:00
break ;
2018-08-20 19:21:43 -07:00
}
2019-01-22 14:04:33 -08:00
ret = sync_filesystem ( sbi - > sb ) ;
if ( ret | | err ) {
2021-04-06 09:47:35 +08:00
err = ret ? ret : err ;
2019-01-22 14:04:33 -08:00
goto restore_flag ;
}
2018-08-20 19:21:43 -07:00
2019-05-29 17:49:06 -07:00
unusable = f2fs_get_unusable_blocks ( sbi ) ;
if ( f2fs_disable_cp_again ( sbi , unusable ) ) {
2019-01-22 14:04:33 -08:00
err = - EAGAIN ;
goto restore_flag ;
}
2018-08-20 19:21:43 -07:00
2022-05-07 00:28:14 +08:00
skip_gc :
2022-01-07 12:48:44 -08:00
f2fs_down_write ( & sbi - > gc_lock ) ;
2018-08-20 19:21:43 -07:00
cpc . reason = CP_PAUSE ;
set_sbi_flag ( sbi , SBI_CP_DISABLED ) ;
2019-04-26 17:57:54 +08:00
err = f2fs_write_checkpoint ( sbi , & cpc ) ;
if ( err )
goto out_unlock ;
2018-08-20 19:21:43 -07:00
2019-05-05 11:40:46 +08:00
spin_lock ( & sbi - > stat_lock ) ;
2019-05-29 17:49:06 -07:00
sbi - > unusable_block_count = unusable ;
2019-05-05 11:40:46 +08:00
spin_unlock ( & sbi - > stat_lock ) ;
2019-04-26 17:57:54 +08:00
out_unlock :
2022-01-07 12:48:44 -08:00
f2fs_up_write ( & sbi - > gc_lock ) ;
2019-01-22 14:04:33 -08:00
restore_flag :
2022-03-18 17:02:30 +08:00
sbi - > gc_mode = gc_mode ;
2020-02-27 19:30:05 +08:00
sbi - > sb - > s_flags = s_flags ; /* Restore SB_RDONLY status */
2019-01-22 14:04:33 -08:00
return err ;
2018-08-20 19:21:43 -07:00
}
static void f2fs_enable_checkpoint ( struct f2fs_sb_info * sbi )
{
2021-08-19 14:00:57 -07:00
int retry = DEFAULT_RETRY_IO_COUNT ;
2021-01-26 17:00:42 -08:00
/* we should flush all the data to keep data consistency */
2021-08-19 14:00:57 -07:00
do {
sync_inodes_sb ( sbi - > sb ) ;
2022-03-22 14:39:13 -07:00
f2fs_io_schedule_timeout ( DEFAULT_IO_TIMEOUT ) ;
2021-08-19 14:00:57 -07:00
} while ( get_pages ( sbi , F2FS_DIRTY_DATA ) & & retry - - ) ;
if ( unlikely ( retry < 0 ) )
f2fs_warn ( sbi , " checkpoint=enable has some unwritten data. " ) ;
2021-01-26 17:00:42 -08:00
2022-01-07 12:48:44 -08:00
f2fs_down_write ( & sbi - > gc_lock ) ;
2018-08-20 19:21:43 -07:00
f2fs_dirty_to_prefree ( sbi ) ;
clear_sbi_flag ( sbi , SBI_CP_DISABLED ) ;
set_sbi_flag ( sbi , SBI_IS_DIRTY ) ;
2022-01-07 12:48:44 -08:00
f2fs_up_write ( & sbi - > gc_lock ) ;
2018-08-20 19:21:43 -07:00
f2fs_sync_fs ( sbi - > sb , 1 ) ;
2022-08-18 22:40:09 -07:00
/* Let's ensure there's no pending checkpoint anymore */
f2fs_flush_ckpt_thread ( sbi ) ;
2018-08-20 19:21:43 -07:00
}
2013-06-16 09:48:48 +09:00
static int f2fs_remount ( struct super_block * sb , int * flags , char * data )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
struct f2fs_mount_info org_mount_opt ;
2017-07-09 00:13:07 +08:00
unsigned long old_sb_flags ;
2018-03-08 14:22:56 +08:00
int err ;
2021-03-17 17:56:04 +08:00
bool need_restart_gc = false , need_stop_gc = false ;
bool need_restart_ckpt = false , need_stop_ckpt = false ;
bool need_restart_flush = false , need_stop_flush = false ;
2021-08-19 16:02:37 +08:00
bool need_restart_discard = false , need_stop_discard = false ;
2022-11-30 09:36:43 -08:00
bool no_read_extent_cache = ! test_opt ( sbi , READ_EXTENT_CACHE ) ;
2022-12-01 17:37:15 -08:00
bool no_age_extent_cache = ! test_opt ( sbi , AGE_EXTENT_CACHE ) ;
2021-07-29 09:22:17 +08:00
bool enable_checkpoint = ! test_opt ( sbi , DISABLE_CHECKPOINT ) ;
2019-07-12 16:57:00 +08:00
bool no_io_align = ! F2FS_IO_ALIGNED ( sbi ) ;
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
bool no_atgc = ! test_opt ( sbi , ATGC ) ;
2021-08-19 16:02:37 +08:00
bool no_discard = ! test_opt ( sbi , DISCARD ) ;
2021-05-20 19:51:50 +08:00
bool no_compress_cache = ! test_opt ( sbi , COMPRESS_CACHE ) ;
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
bool block_unit_discard = f2fs_block_unit_discard ( sbi ) ;
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
int i , j ;
# endif
2013-06-16 09:48:48 +09:00
/*
* Save the old mount options in case we
* need to restore them .
*/
org_mount_opt = sbi - > mount_opt ;
2017-07-09 00:13:07 +08:00
old_sb_flags = sb - > s_flags ;
2013-06-16 09:48:48 +09:00
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
2018-03-08 14:22:56 +08:00
org_mount_opt . s_jquota_fmt = F2FS_OPTION ( sbi ) . s_jquota_fmt ;
2017-08-08 10:54:31 +08:00
for ( i = 0 ; i < MAXQUOTAS ; i + + ) {
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ i ] ) {
org_mount_opt . s_qf_names [ i ] =
kstrdup ( F2FS_OPTION ( sbi ) . s_qf_names [ i ] ,
GFP_KERNEL ) ;
if ( ! org_mount_opt . s_qf_names [ i ] ) {
2017-08-08 10:54:31 +08:00
for ( j = 0 ; j < i ; j + + )
2020-06-17 20:30:12 +08:00
kfree ( org_mount_opt . s_qf_names [ j ] ) ;
2017-08-08 10:54:31 +08:00
return - ENOMEM ;
}
} else {
2018-03-08 14:22:56 +08:00
org_mount_opt . s_qf_names [ i ] = NULL ;
2017-08-08 10:54:31 +08:00
}
}
# endif
2016-03-23 17:05:27 -07:00
/* recover superblocks we couldn't write due to previous RO mount */
2017-11-27 13:05:09 -08:00
if ( ! ( * flags & SB_RDONLY ) & & is_sbi_flag_set ( sbi , SBI_NEED_SB_WRITE ) ) {
2016-03-23 17:05:27 -07:00
err = f2fs_commit_super ( sbi , false ) ;
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Try to recover all the superblocks, ret: %d " ,
err ) ;
2016-03-23 17:05:27 -07:00
if ( ! err )
clear_sbi_flag ( sbi , SBI_NEED_SB_WRITE ) ;
}
2015-05-07 18:11:37 +08:00
default_options ( sbi ) ;
2014-09-15 18:04:44 +08:00
2013-06-16 09:48:48 +09:00
/* parse mount options */
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
err = parse_options ( sb , data , true ) ;
2013-06-16 09:48:48 +09:00
if ( err )
goto restore_opts ;
/*
* Previous and new state of filesystem is RO ,
2014-04-11 17:50:00 +08:00
* so skip checking GC and FLUSH_MERGE conditions .
2013-06-16 09:48:48 +09:00
*/
2017-11-27 13:05:09 -08:00
if ( f2fs_readonly ( sb ) & & ( * flags & SB_RDONLY ) )
2013-06-16 09:48:48 +09:00
goto skip ;
2021-05-21 01:32:53 -07:00
if ( f2fs_sb_has_readonly ( sbi ) & & ! ( * flags & SB_RDONLY ) ) {
err = - EROFS ;
goto restore_opts ;
}
2017-10-06 09:14:28 -07:00
# ifdef CONFIG_QUOTA
2017-11-27 13:05:09 -08:00
if ( ! f2fs_readonly ( sb ) & & ( * flags & SB_RDONLY ) ) {
2017-07-09 00:13:07 +08:00
err = dquot_suspend ( sb , - 1 ) ;
if ( err < 0 )
goto restore_opts ;
2018-08-20 19:21:43 -07:00
} else if ( f2fs_readonly ( sb ) & & ! ( * flags & SB_RDONLY ) ) {
2017-07-09 00:13:07 +08:00
/* dquot_resume needs RW */
2017-11-27 13:05:09 -08:00
sb - > s_flags & = ~ SB_RDONLY ;
2017-10-06 09:14:28 -07:00
if ( sb_any_quota_suspended ( sb ) ) {
dquot_resume ( sb , - 1 ) ;
2018-10-24 18:34:26 +08:00
} else if ( f2fs_sb_has_quota_ino ( sbi ) ) {
2017-10-06 09:14:28 -07:00
err = f2fs_enable_quotas ( sb ) ;
if ( err )
goto restore_opts ;
}
2017-07-09 00:13:07 +08:00
}
2017-10-06 09:14:28 -07:00
# endif
2023-02-06 22:43:08 +08:00
if ( f2fs_lfs_mode ( sbi ) & & ! IS_F2FS_IPU_DISABLE ( sbi ) ) {
err = - EINVAL ;
f2fs_warn ( sbi , " LFS is not compatible with IPU " ) ;
goto restore_opts ;
}
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
/* disallow enable atgc dynamically */
if ( no_atgc = = ! ! test_opt ( sbi , ATGC ) ) {
err = - EINVAL ;
f2fs_warn ( sbi , " switch atgc option is not allowed " ) ;
goto restore_opts ;
}
2015-09-18 16:55:26 +08:00
/* disallow enable/disable extent_cache dynamically */
2022-11-30 09:36:43 -08:00
if ( no_read_extent_cache = = ! ! test_opt ( sbi , READ_EXTENT_CACHE ) ) {
2015-09-18 16:55:26 +08:00
err = - EINVAL ;
2019-06-18 17:48:42 +08:00
f2fs_warn ( sbi , " switch extent_cache option is not allowed " ) ;
2015-09-18 16:55:26 +08:00
goto restore_opts ;
}
2022-12-01 17:37:15 -08:00
/* disallow enable/disable age extent_cache dynamically */
if ( no_age_extent_cache = = ! ! test_opt ( sbi , AGE_EXTENT_CACHE ) ) {
err = - EINVAL ;
f2fs_warn ( sbi , " switch age_extent_cache option is not allowed " ) ;
goto restore_opts ;
}
2015-09-18 16:55:26 +08:00
2019-07-12 16:57:00 +08:00
if ( no_io_align = = ! ! F2FS_IO_ALIGNED ( sbi ) ) {
err = - EINVAL ;
f2fs_warn ( sbi , " switch io_bits option is not allowed " ) ;
goto restore_opts ;
}
2021-05-20 19:51:50 +08:00
if ( no_compress_cache = = ! ! test_opt ( sbi , COMPRESS_CACHE ) ) {
err = - EINVAL ;
f2fs_warn ( sbi , " switch compress_cache option is not allowed " ) ;
goto restore_opts ;
}
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
if ( block_unit_discard ! = f2fs_block_unit_discard ( sbi ) ) {
err = - EINVAL ;
f2fs_warn ( sbi , " switch discard_unit option is not allowed " ) ;
goto restore_opts ;
}
2018-08-20 19:21:43 -07:00
if ( ( * flags & SB_RDONLY ) & & test_opt ( sbi , DISABLE_CHECKPOINT ) ) {
err = - EINVAL ;
2019-06-18 17:48:42 +08:00
f2fs_warn ( sbi , " disabling checkpoint not compatible with read-only " ) ;
2018-08-20 19:21:43 -07:00
goto restore_opts ;
}
2013-06-16 09:48:48 +09:00
/*
* We stop the GC thread if FS is mounted as RO
* or if background_gc = off is passed in mount
* option . Also sync the filesystem .
*/
2020-02-14 17:44:13 +08:00
if ( ( * flags & SB_RDONLY ) | |
2021-03-27 17:57:06 +08:00
( F2FS_OPTION ( sbi ) . bggc_mode = = BGGC_MODE_OFF & &
! test_opt ( sbi , GC_MERGE ) ) ) {
2013-06-16 09:48:48 +09:00
if ( sbi - > gc_thread ) {
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_stop_gc_thread ( sbi ) ;
2014-04-11 17:50:00 +08:00
need_restart_gc = true ;
2013-06-16 09:48:48 +09:00
}
2014-11-18 11:17:20 +08:00
} else if ( ! sbi - > gc_thread ) {
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_start_gc_thread ( sbi ) ;
2013-06-16 09:48:48 +09:00
if ( err )
goto restore_opts ;
2014-04-11 17:50:00 +08:00
need_stop_gc = true ;
}
2022-04-12 14:45:50 -07:00
if ( * flags & SB_RDONLY ) {
2016-03-24 10:29:39 -07:00
sync_inodes_sb ( sb ) ;
set_sbi_flag ( sbi , SBI_IS_DIRTY ) ;
set_sbi_flag ( sbi , SBI_IS_CLOSE ) ;
f2fs_sync_fs ( sb , 1 ) ;
clear_sbi_flag ( sbi , SBI_IS_CLOSE ) ;
}
2021-03-17 17:56:03 +08:00
if ( ( * flags & SB_RDONLY ) | | test_opt ( sbi , DISABLE_CHECKPOINT ) | |
! test_opt ( sbi , MERGE_CHECKPOINT ) ) {
f2fs_stop_ckpt_thread ( sbi ) ;
2021-03-17 17:56:04 +08:00
need_restart_ckpt = true ;
2021-03-17 17:56:03 +08:00
} else {
2022-08-18 22:40:09 -07:00
/* Flush if the prevous checkpoint, if exists. */
f2fs_flush_ckpt_thread ( sbi ) ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
err = f2fs_start_ckpt_thread ( sbi ) ;
if ( err ) {
f2fs_err ( sbi ,
" Failed to start F2FS issue_checkpoint_thread (%d) " ,
err ) ;
goto restore_gc ;
}
2021-03-17 17:56:04 +08:00
need_stop_ckpt = true ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
}
2014-04-11 17:50:00 +08:00
/*
* We stop issue flush thread if FS is mounted as RO
* or if flush_merge is not passed in mount option .
*/
2017-11-27 13:05:09 -08:00
if ( ( * flags & SB_RDONLY ) | | ! test_opt ( sbi , FLUSH_MERGE ) ) {
2016-12-07 16:23:32 -08:00
clear_opt ( sbi , FLUSH_MERGE ) ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_flush_cmd_control ( sbi , false ) ;
2021-03-17 17:56:04 +08:00
need_restart_flush = true ;
2016-12-07 16:23:32 -08:00
} else {
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_create_flush_cmd_control ( sbi ) ;
2014-04-27 14:21:33 +08:00
if ( err )
2021-03-17 17:56:04 +08:00
goto restore_ckpt ;
need_stop_flush = true ;
2013-06-16 09:48:48 +09:00
}
2021-03-17 17:56:04 +08:00
2021-08-19 16:02:37 +08:00
if ( no_discard = = ! ! test_opt ( sbi , DISCARD ) ) {
if ( test_opt ( sbi , DISCARD ) ) {
err = f2fs_start_discard_thread ( sbi ) ;
if ( err )
goto restore_flush ;
need_stop_discard = true ;
} else {
f2fs_stop_discard_thread ( sbi ) ;
2022-12-02 12:58:41 +08:00
f2fs_issue_discard_timeout ( sbi ) ;
2021-08-19 16:02:37 +08:00
need_restart_discard = true ;
}
}
2021-07-29 09:22:17 +08:00
if ( enable_checkpoint = = ! ! test_opt ( sbi , DISABLE_CHECKPOINT ) ) {
2021-03-17 17:56:04 +08:00
if ( test_opt ( sbi , DISABLE_CHECKPOINT ) ) {
err = f2fs_disable_checkpoint ( sbi ) ;
if ( err )
2021-08-19 16:02:37 +08:00
goto restore_discard ;
2021-03-17 17:56:04 +08:00
} else {
f2fs_enable_checkpoint ( sbi ) ;
}
}
2013-06-16 09:48:48 +09:00
skip :
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
/* Release old quota file names */
for ( i = 0 ; i < MAXQUOTAS ; i + + )
2020-06-17 20:30:12 +08:00
kfree ( org_mount_opt . s_qf_names [ i ] ) ;
2017-08-08 10:54:31 +08:00
# endif
2013-06-16 09:48:48 +09:00
/* Update the POSIXACL Flag */
2017-11-27 13:05:09 -08:00
sb - > s_flags = ( sb - > s_flags & ~ SB_POSIXACL ) |
( test_opt ( sbi , POSIX_ACL ) ? SB_POSIXACL : 0 ) ;
2016-03-23 17:05:27 -07:00
2017-12-27 15:05:52 -08:00
limit_reserve_root ( sbi ) ;
2020-05-15 17:20:50 -07:00
adjust_unusable_cap_perc ( sbi ) ;
2018-09-28 00:24:39 -07:00
* flags = ( * flags & ~ SB_LAZYTIME ) | ( sb - > s_flags & SB_LAZYTIME ) ;
2013-06-16 09:48:48 +09:00
return 0 ;
2021-08-19 16:02:37 +08:00
restore_discard :
if ( need_restart_discard ) {
if ( f2fs_start_discard_thread ( sbi ) )
f2fs_warn ( sbi , " discard has been stopped " ) ;
} else if ( need_stop_discard ) {
f2fs_stop_discard_thread ( sbi ) ;
}
2021-03-17 17:56:04 +08:00
restore_flush :
if ( need_restart_flush ) {
if ( f2fs_create_flush_cmd_control ( sbi ) )
f2fs_warn ( sbi , " background flush thread has stopped " ) ;
} else if ( need_stop_flush ) {
clear_opt ( sbi , FLUSH_MERGE ) ;
f2fs_destroy_flush_cmd_control ( sbi , false ) ;
}
restore_ckpt :
if ( need_restart_ckpt ) {
if ( f2fs_start_ckpt_thread ( sbi ) )
f2fs_warn ( sbi , " background ckpt thread has stopped " ) ;
} else if ( need_stop_ckpt ) {
f2fs_stop_ckpt_thread ( sbi ) ;
}
2014-04-11 17:50:00 +08:00
restore_gc :
if ( need_restart_gc ) {
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
if ( f2fs_start_gc_thread ( sbi ) )
2019-06-18 17:48:42 +08:00
f2fs_warn ( sbi , " background gc thread has stopped " ) ;
2014-04-11 17:50:00 +08:00
} else if ( need_stop_gc ) {
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_stop_gc_thread ( sbi ) ;
2014-04-11 17:50:00 +08:00
}
2013-06-16 09:48:48 +09:00
restore_opts :
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_jquota_fmt = org_mount_opt . s_jquota_fmt ;
2017-08-08 10:54:31 +08:00
for ( i = 0 ; i < MAXQUOTAS ; i + + ) {
2020-06-17 20:30:12 +08:00
kfree ( F2FS_OPTION ( sbi ) . s_qf_names [ i ] ) ;
2018-03-08 14:22:56 +08:00
F2FS_OPTION ( sbi ) . s_qf_names [ i ] = org_mount_opt . s_qf_names [ i ] ;
2017-08-08 10:54:31 +08:00
}
# endif
2013-06-16 09:48:48 +09:00
sbi - > mount_opt = org_mount_opt ;
2017-07-09 00:13:07 +08:00
sb - > s_flags = old_sb_flags ;
2013-06-16 09:48:48 +09:00
return err ;
}
2017-07-09 00:13:07 +08:00
# ifdef CONFIG_QUOTA
/* Read data from quotafile */
static ssize_t f2fs_quota_read ( struct super_block * sb , int type , char * data ,
size_t len , loff_t off )
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
struct address_space * mapping = inode - > i_mapping ;
block_t blkidx = F2FS_BYTES_TO_BLK ( off ) ;
int offset = off & ( sb - > s_blocksize - 1 ) ;
int tocopy ;
size_t toread ;
loff_t i_size = i_size_read ( inode ) ;
struct page * page ;
if ( off > i_size )
return 0 ;
if ( off + len > i_size )
len = i_size - off ;
toread = len ;
while ( toread > 0 ) {
tocopy = min_t ( unsigned long , sb - > s_blocksize - offset , toread ) ;
repeat :
2018-03-16 18:53:53 +05:30
page = read_cache_page_gfp ( mapping , blkidx , GFP_NOFS ) ;
2017-10-19 09:43:56 -07:00
if ( IS_ERR ( page ) ) {
if ( PTR_ERR ( page ) = = - ENOMEM ) {
mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-14 14:07:14 -08:00
memalloc_retry_wait ( GFP_NOFS ) ;
2017-10-19 09:43:56 -07:00
goto repeat ;
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
set_sbi_flag ( F2FS_SB ( sb ) , SBI_QUOTA_NEED_REPAIR ) ;
2017-07-09 00:13:07 +08:00
return PTR_ERR ( page ) ;
2017-10-19 09:43:56 -07:00
}
2017-07-09 00:13:07 +08:00
lock_page ( page ) ;
if ( unlikely ( page - > mapping ! = mapping ) ) {
f2fs_put_page ( page , 1 ) ;
goto repeat ;
}
if ( unlikely ( ! PageUptodate ( page ) ) ) {
f2fs_put_page ( page , 1 ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
set_sbi_flag ( F2FS_SB ( sb ) , SBI_QUOTA_NEED_REPAIR ) ;
2017-07-09 00:13:07 +08:00
return - EIO ;
}
2022-08-19 15:33:00 -07:00
memcpy_from_page ( data , page , offset , tocopy ) ;
2017-07-09 00:13:07 +08:00
f2fs_put_page ( page , 1 ) ;
offset = 0 ;
toread - = tocopy ;
data + = tocopy ;
blkidx + + ;
}
return len ;
}
/* Write to quotafile */
static ssize_t f2fs_quota_write ( struct super_block * sb , int type ,
const char * data , size_t len , loff_t off )
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
struct address_space * mapping = inode - > i_mapping ;
const struct address_space_operations * a_ops = mapping - > a_ops ;
int offset = off & ( sb - > s_blocksize - 1 ) ;
size_t towrite = len ;
struct page * page ;
2020-03-19 19:58:00 +08:00
void * fsdata = NULL ;
2017-07-09 00:13:07 +08:00
int err = 0 ;
int tocopy ;
while ( towrite > 0 ) {
tocopy = min_t ( unsigned long , sb - > s_blocksize - offset ,
towrite ) ;
2017-10-19 09:43:56 -07:00
retry :
2022-02-22 14:31:43 -05:00
err = a_ops - > write_begin ( NULL , mapping , off , tocopy ,
2020-03-19 19:58:00 +08:00
& page , & fsdata ) ;
2017-10-19 09:43:56 -07:00
if ( unlikely ( err ) ) {
if ( err = = - ENOMEM ) {
2022-03-22 14:39:13 -07:00
f2fs_io_schedule_timeout ( DEFAULT_IO_TIMEOUT ) ;
2017-10-19 09:43:56 -07:00
goto retry ;
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
set_sbi_flag ( F2FS_SB ( sb ) , SBI_QUOTA_NEED_REPAIR ) ;
2017-07-09 00:13:07 +08:00
break ;
2017-10-19 09:43:56 -07:00
}
2017-07-09 00:13:07 +08:00
2022-08-19 15:33:00 -07:00
memcpy_to_page ( page , offset , data , tocopy ) ;
2017-07-09 00:13:07 +08:00
a_ops - > write_end ( NULL , mapping , off , tocopy , tocopy ,
2020-03-19 19:58:00 +08:00
page , fsdata ) ;
2017-07-09 00:13:07 +08:00
offset = 0 ;
towrite - = tocopy ;
off + = tocopy ;
data + = tocopy ;
cond_resched ( ) ;
}
if ( len = = towrite )
2017-10-19 12:07:11 -07:00
return err ;
2017-07-09 00:13:07 +08:00
inode - > i_mtime = inode - > i_ctime = current_time ( inode ) ;
f2fs_mark_inode_dirty_sync ( inode , false ) ;
return len - towrite ;
}
2021-10-28 21:03:05 +08:00
int f2fs_dquot_initialize ( struct inode * inode )
{
2022-12-21 02:39:04 +08:00
if ( time_to_inject ( F2FS_I_SB ( inode ) , FAULT_DQUOT_INIT ) )
2021-10-28 21:03:05 +08:00
return - ESRCH ;
return dquot_initialize ( inode ) ;
}
2017-07-09 00:13:07 +08:00
static struct dquot * * f2fs_get_dquots ( struct inode * inode )
{
return F2FS_I ( inode ) - > i_dquot ;
}
static qsize_t * f2fs_get_reserved_space ( struct inode * inode )
{
return & F2FS_I ( inode ) - > i_reserved_quota ;
}
2017-08-08 10:54:31 +08:00
static int f2fs_quota_on_mount ( struct f2fs_sb_info * sbi , int type )
{
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
if ( is_set_ckpt_flags ( sbi , CP_QUOTA_NEED_FSCK_FLAG ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " quota sysfile may be corrupted, skip loading it " ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return 0 ;
}
2018-03-08 14:22:56 +08:00
return dquot_quota_on_mount ( sbi - > sb , F2FS_OPTION ( sbi ) . s_qf_names [ type ] ,
F2FS_OPTION ( sbi ) . s_jquota_fmt , type ) ;
2017-08-08 10:54:31 +08:00
}
2017-10-06 09:14:28 -07:00
int f2fs_enable_quota_files ( struct f2fs_sb_info * sbi , bool rdonly )
2017-08-08 10:54:31 +08:00
{
2017-10-06 09:14:28 -07:00
int enabled = 0 ;
int i , err ;
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) & & rdonly ) {
2017-10-06 09:14:28 -07:00
err = f2fs_enable_quotas ( sbi - > sb ) ;
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot turn on quota_ino: %d " , err ) ;
2017-10-06 09:14:28 -07:00
return 0 ;
}
return 1 ;
}
2017-08-08 10:54:31 +08:00
for ( i = 0 ; i < MAXQUOTAS ; i + + ) {
2018-03-08 14:22:56 +08:00
if ( F2FS_OPTION ( sbi ) . s_qf_names [ i ] ) {
2017-10-06 09:14:28 -07:00
err = f2fs_quota_on_mount ( sbi , i ) ;
if ( ! err ) {
enabled = 1 ;
continue ;
}
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot turn on quotas: %d on %d " ,
err , i ) ;
2017-08-08 10:54:31 +08:00
}
}
2017-10-06 09:14:28 -07:00
return enabled ;
}
static int f2fs_quota_enable ( struct super_block * sb , int type , int format_id ,
unsigned int flags )
{
struct inode * qf_inode ;
unsigned long qf_inum ;
int err ;
2018-10-24 18:34:26 +08:00
BUG_ON ( ! f2fs_sb_has_quota_ino ( F2FS_SB ( sb ) ) ) ;
2017-10-06 09:14:28 -07:00
qf_inum = f2fs_qf_ino ( sb , type ) ;
if ( ! qf_inum )
return - EPERM ;
qf_inode = f2fs_iget ( sb , qf_inum ) ;
if ( IS_ERR ( qf_inode ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( F2FS_SB ( sb ) , " Bad quota inode %u:%lu " , type , qf_inum ) ;
2017-10-06 09:14:28 -07:00
return PTR_ERR ( qf_inode ) ;
}
/* Don't account quota for quota files to avoid recursion */
qf_inode - > i_flags | = S_NOQUOTA ;
2019-11-01 18:55:38 +01:00
err = dquot_load_quota_inode ( qf_inode , type , format_id , flags ) ;
2017-10-06 09:14:28 -07:00
iput ( qf_inode ) ;
return err ;
}
static int f2fs_enable_quotas ( struct super_block * sb )
{
2019-06-18 17:48:42 +08:00
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2017-10-06 09:14:28 -07:00
int type , err = 0 ;
unsigned long qf_inum ;
bool quota_mopt [ MAXQUOTAS ] = {
2019-06-18 17:48:42 +08:00
test_opt ( sbi , USRQUOTA ) ,
test_opt ( sbi , GRPQUOTA ) ,
test_opt ( sbi , PRJQUOTA ) ,
2017-10-06 09:14:28 -07:00
} ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
if ( is_set_ckpt_flags ( F2FS_SB ( sb ) , CP_QUOTA_NEED_FSCK_FLAG ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " quota file may be corrupted, skip loading it " ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return 0 ;
}
sb_dqopt ( sb ) - > flags | = DQUOT_QUOTA_SYS_FILE ;
2017-10-06 09:14:28 -07:00
for ( type = 0 ; type < MAXQUOTAS ; type + + ) {
qf_inum = f2fs_qf_ino ( sb , type ) ;
if ( qf_inum ) {
err = f2fs_quota_enable ( sb , type , QFMT_VFS_V1 ,
DQUOT_USAGE_ENABLED |
( quota_mopt [ type ] ? DQUOT_LIMITS_ENABLED : 0 ) ) ;
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to enable quota tracking (type=%d, err=%d). Please run fsck to fix. " ,
type , err ) ;
2017-10-06 09:14:28 -07:00
for ( type - - ; type > = 0 ; type - - )
dquot_quota_off ( sb , type ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
set_sbi_flag ( F2FS_SB ( sb ) ,
SBI_QUOTA_NEED_REPAIR ) ;
2017-10-06 09:14:28 -07:00
return err ;
}
2017-08-08 10:54:31 +08:00
}
}
2017-10-06 09:14:28 -07:00
return 0 ;
2017-08-08 10:54:31 +08:00
}
2021-07-19 16:46:47 +08:00
static int f2fs_quota_sync_file ( struct f2fs_sb_info * sbi , int type )
{
struct quota_info * dqopt = sb_dqopt ( sbi - > sb ) ;
struct address_space * mapping = dqopt - > files [ type ] - > i_mapping ;
int ret = 0 ;
ret = dquot_writeback_dquots ( sbi - > sb , type ) ;
if ( ret )
goto out ;
ret = filemap_fdatawrite ( mapping ) ;
if ( ret )
goto out ;
/* if we are using journalled quota */
if ( is_journalled_quota ( sbi ) )
goto out ;
ret = filemap_fdatawait ( mapping ) ;
truncate_inode_pages ( & dqopt - > files [ type ] - > i_data , 0 ) ;
out :
if ( ret )
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
return ret ;
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
int f2fs_quota_sync ( struct super_block * sb , int type )
2017-07-09 00:13:07 +08:00
{
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2017-07-09 00:13:07 +08:00
struct quota_info * dqopt = sb_dqopt ( sb ) ;
int cnt ;
f2fs: quota: fix loop condition at f2fs_quota_sync()
cnt should be passed to sb_has_quota_active() instead of type to check
active quota properly.
Moreover, when the type is -1, the compiler with enough inline knowledge
can discard sb_has_quota_active() check altogether, causing a NULL pointer
dereference at the following inode_lock(dqopt->files[cnt]):
[ 2.796010] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[ 2.796024] Mem abort info:
[ 2.796025] ESR = 0x96000005
[ 2.796028] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2.796029] SET = 0, FnV = 0
[ 2.796031] EA = 0, S1PTW = 0
[ 2.796032] Data abort info:
[ 2.796034] ISV = 0, ISS = 0x00000005
[ 2.796035] CM = 0, WnR = 0
[ 2.796046] user pgtable: 4k pages, 39-bit VAs, pgdp=00000003370d1000
[ 2.796048] [00000000000000a0] pgd=0000000000000000, pud=0000000000000000
[ 2.796051] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 2.796056] CPU: 7 PID: 640 Comm: f2fs_ckpt-259:7 Tainted: G S 5.4.179-arter97-r8-64666-g2f16e087f9d8 #1
[ 2.796057] Hardware name: Qualcomm Technologies, Inc. Lahaina MTP lemonadep (DT)
[ 2.796059] pstate: 80c00005 (Nzcv daif +PAN +UAO)
[ 2.796065] pc : down_write+0x28/0x70
[ 2.796070] lr : f2fs_quota_sync+0x100/0x294
[ 2.796071] sp : ffffffa3f48ffc30
[ 2.796073] x29: ffffffa3f48ffc30 x28: 0000000000000000
[ 2.796075] x27: ffffffa3f6d718b8 x26: ffffffa415fe9d80
[ 2.796077] x25: ffffffa3f7290048 x24: 0000000000000001
[ 2.796078] x23: 0000000000000000 x22: ffffffa3f7290000
[ 2.796080] x21: ffffffa3f72904a0 x20: ffffffa3f7290110
[ 2.796081] x19: ffffffa3f77a9800 x18: ffffffc020aae038
[ 2.796083] x17: ffffffa40e38e040 x16: ffffffa40e38e6d0
[ 2.796085] x15: ffffffa40e38e6cc x14: ffffffa40e38e6d0
[ 2.796086] x13: 00000000000004f6 x12: 00162c44ff493000
[ 2.796088] x11: 0000000000000400 x10: ffffffa40e38c948
[ 2.796090] x9 : 0000000000000000 x8 : 00000000000000a0
[ 2.796091] x7 : 0000000000000000 x6 : 0000d1060f00002a
[ 2.796093] x5 : ffffffa3f48ff718 x4 : 000000000000000d
[ 2.796094] x3 : 00000000060c0000 x2 : 0000000000000001
[ 2.796096] x1 : 0000000000000000 x0 : 00000000000000a0
[ 2.796098] Call trace:
[ 2.796100] down_write+0x28/0x70
[ 2.796102] f2fs_quota_sync+0x100/0x294
[ 2.796104] block_operations+0x120/0x204
[ 2.796106] f2fs_write_checkpoint+0x11c/0x520
[ 2.796107] __checkpoint_and_complete_reqs+0x7c/0xd34
[ 2.796109] issue_checkpoint_thread+0x6c/0xb8
[ 2.796112] kthread+0x138/0x414
[ 2.796114] ret_from_fork+0x10/0x18
[ 2.796117] Code: aa0803e0 aa1f03e1 52800022 aa0103e9 (c8e97d02)
[ 2.796120] ---[ end trace 96e942e8eb6a0b53 ]---
[ 2.800116] Kernel panic - not syncing: Fatal exception
[ 2.800120] SMP: stopping secondary CPUs
Fixes: 9de71ede81e6 ("f2fs: quota: fix potential deadlock")
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-15 17:27:21 +09:00
int ret = 0 ;
2017-07-09 00:13:07 +08:00
/*
* Now when everything is written we can discard the pagecache so
* that userspace sees the changes .
*/
for ( cnt = 0 ; cnt < MAXQUOTAS ; cnt + + ) {
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
2017-07-09 00:13:07 +08:00
if ( type ! = - 1 & & cnt ! = type )
continue ;
f2fs: quota: fix loop condition at f2fs_quota_sync()
cnt should be passed to sb_has_quota_active() instead of type to check
active quota properly.
Moreover, when the type is -1, the compiler with enough inline knowledge
can discard sb_has_quota_active() check altogether, causing a NULL pointer
dereference at the following inode_lock(dqopt->files[cnt]):
[ 2.796010] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[ 2.796024] Mem abort info:
[ 2.796025] ESR = 0x96000005
[ 2.796028] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2.796029] SET = 0, FnV = 0
[ 2.796031] EA = 0, S1PTW = 0
[ 2.796032] Data abort info:
[ 2.796034] ISV = 0, ISS = 0x00000005
[ 2.796035] CM = 0, WnR = 0
[ 2.796046] user pgtable: 4k pages, 39-bit VAs, pgdp=00000003370d1000
[ 2.796048] [00000000000000a0] pgd=0000000000000000, pud=0000000000000000
[ 2.796051] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 2.796056] CPU: 7 PID: 640 Comm: f2fs_ckpt-259:7 Tainted: G S 5.4.179-arter97-r8-64666-g2f16e087f9d8 #1
[ 2.796057] Hardware name: Qualcomm Technologies, Inc. Lahaina MTP lemonadep (DT)
[ 2.796059] pstate: 80c00005 (Nzcv daif +PAN +UAO)
[ 2.796065] pc : down_write+0x28/0x70
[ 2.796070] lr : f2fs_quota_sync+0x100/0x294
[ 2.796071] sp : ffffffa3f48ffc30
[ 2.796073] x29: ffffffa3f48ffc30 x28: 0000000000000000
[ 2.796075] x27: ffffffa3f6d718b8 x26: ffffffa415fe9d80
[ 2.796077] x25: ffffffa3f7290048 x24: 0000000000000001
[ 2.796078] x23: 0000000000000000 x22: ffffffa3f7290000
[ 2.796080] x21: ffffffa3f72904a0 x20: ffffffa3f7290110
[ 2.796081] x19: ffffffa3f77a9800 x18: ffffffc020aae038
[ 2.796083] x17: ffffffa40e38e040 x16: ffffffa40e38e6d0
[ 2.796085] x15: ffffffa40e38e6cc x14: ffffffa40e38e6d0
[ 2.796086] x13: 00000000000004f6 x12: 00162c44ff493000
[ 2.796088] x11: 0000000000000400 x10: ffffffa40e38c948
[ 2.796090] x9 : 0000000000000000 x8 : 00000000000000a0
[ 2.796091] x7 : 0000000000000000 x6 : 0000d1060f00002a
[ 2.796093] x5 : ffffffa3f48ff718 x4 : 000000000000000d
[ 2.796094] x3 : 00000000060c0000 x2 : 0000000000000001
[ 2.796096] x1 : 0000000000000000 x0 : 00000000000000a0
[ 2.796098] Call trace:
[ 2.796100] down_write+0x28/0x70
[ 2.796102] f2fs_quota_sync+0x100/0x294
[ 2.796104] block_operations+0x120/0x204
[ 2.796106] f2fs_write_checkpoint+0x11c/0x520
[ 2.796107] __checkpoint_and_complete_reqs+0x7c/0xd34
[ 2.796109] issue_checkpoint_thread+0x6c/0xb8
[ 2.796112] kthread+0x138/0x414
[ 2.796114] ret_from_fork+0x10/0x18
[ 2.796117] Code: aa0803e0 aa1f03e1 52800022 aa0103e9 (c8e97d02)
[ 2.796120] ---[ end trace 96e942e8eb6a0b53 ]---
[ 2.800116] Kernel panic - not syncing: Fatal exception
[ 2.800120] SMP: stopping secondary CPUs
Fixes: 9de71ede81e6 ("f2fs: quota: fix potential deadlock")
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-15 17:27:21 +09:00
if ( ! sb_has_quota_active ( sb , cnt ) )
continue ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
2022-05-05 17:40:25 -07:00
if ( ! f2fs_sb_has_quota_ino ( sbi ) )
inode_lock ( dqopt - > files [ cnt ] ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
2021-07-19 16:46:47 +08:00
/*
* do_quotactl
* f2fs_quota_sync
2022-01-07 12:48:44 -08:00
* f2fs_down_read ( quota_sem )
2021-07-19 16:46:47 +08:00
* dquot_writeback_dquots ( )
* f2fs_dquot_commit
* block_operation
2022-01-07 12:48:44 -08:00
* f2fs_down_read ( quota_sem )
2021-07-19 16:46:47 +08:00
*/
f2fs_lock_op ( sbi ) ;
2022-01-07 12:48:44 -08:00
f2fs_down_read ( & sbi - > quota_sem ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
2021-07-19 16:46:47 +08:00
ret = f2fs_quota_sync_file ( sbi , cnt ) ;
2022-01-07 12:48:44 -08:00
f2fs_up_read ( & sbi - > quota_sem ) ;
2021-07-19 16:46:47 +08:00
f2fs_unlock_op ( sbi ) ;
2017-07-09 00:13:07 +08:00
2022-05-05 17:40:25 -07:00
if ( ! f2fs_sb_has_quota_ino ( sbi ) )
inode_unlock ( dqopt - > files [ cnt ] ) ;
2021-07-19 16:46:47 +08:00
if ( ret )
break ;
2017-07-09 00:13:07 +08:00
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return ret ;
2017-07-09 00:13:07 +08:00
}
static int f2fs_quota_on ( struct super_block * sb , int type , int format_id ,
const struct path * path )
{
struct inode * inode ;
int err ;
2019-07-25 17:33:37 +08:00
/* if quota sysfile exists, deny enabling quota with specific file */
if ( f2fs_sb_has_quota_ino ( F2FS_SB ( sb ) ) ) {
f2fs_err ( F2FS_SB ( sb ) , " quota sysfile already exists " ) ;
return - EBUSY ;
}
2017-08-07 16:37:59 +08:00
err = f2fs_quota_sync ( sb , type ) ;
2017-07-09 00:13:07 +08:00
if ( err )
return err ;
err = dquot_quota_on ( sb , type , format_id , path ) ;
if ( err )
return err ;
inode = d_inode ( path - > dentry ) ;
inode_lock ( inode ) ;
2018-04-03 15:08:17 +08:00
F2FS_I ( inode ) - > i_flags | = F2FS_NOATIME_FL | F2FS_IMMUTABLE_FL ;
2018-10-07 19:06:15 +08:00
f2fs_set_inode_flags ( inode ) ;
2017-07-09 00:13:07 +08:00
inode_unlock ( inode ) ;
f2fs_mark_inode_dirty_sync ( inode , false ) ;
return 0 ;
}
2019-07-25 17:33:37 +08:00
static int __f2fs_quota_off ( struct super_block * sb , int type )
2017-07-09 00:13:07 +08:00
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
int err ;
if ( ! inode | | ! igrab ( inode ) )
return dquot_quota_off ( sb , type ) ;
2018-06-26 13:12:43 +08:00
err = f2fs_quota_sync ( sb , type ) ;
if ( err )
goto out_put ;
2017-07-09 00:13:07 +08:00
err = dquot_quota_off ( sb , type ) ;
2018-10-24 18:34:26 +08:00
if ( err | | f2fs_sb_has_quota_ino ( F2FS_SB ( sb ) ) )
2017-07-09 00:13:07 +08:00
goto out_put ;
inode_lock ( inode ) ;
2018-04-03 15:08:17 +08:00
F2FS_I ( inode ) - > i_flags & = ~ ( F2FS_NOATIME_FL | F2FS_IMMUTABLE_FL ) ;
2018-10-07 19:06:15 +08:00
f2fs_set_inode_flags ( inode ) ;
2017-07-09 00:13:07 +08:00
inode_unlock ( inode ) ;
f2fs_mark_inode_dirty_sync ( inode , false ) ;
out_put :
iput ( inode ) ;
return err ;
}
2019-07-25 17:33:37 +08:00
static int f2fs_quota_off ( struct super_block * sb , int type )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
int err ;
err = __f2fs_quota_off ( sb , type ) ;
/*
* quotactl can shutdown journalled quota , result in inconsistence
* between quota record and fs data by following updates , tag the
* flag to let fsck be aware of it .
*/
if ( is_journalled_quota ( sbi ) )
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
return err ;
}
2017-08-08 10:54:31 +08:00
void f2fs_quota_off_umount ( struct super_block * sb )
2017-07-09 00:13:07 +08:00
{
int type ;
2018-06-26 13:12:43 +08:00
int err ;
for ( type = 0 ; type < MAXQUOTAS ; type + + ) {
2019-07-25 17:33:37 +08:00
err = __f2fs_quota_off ( sb , type ) ;
2018-06-26 13:12:43 +08:00
if ( err ) {
int ret = dquot_quota_off ( sb , type ) ;
2017-07-09 00:13:07 +08:00
2019-06-18 17:48:42 +08:00
f2fs_err ( F2FS_SB ( sb ) , " Fail to turn off disk quota (type: %d, err: %d, ret:%d), Please run fsck to fix it. " ,
type , err , ret ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
set_sbi_flag ( F2FS_SB ( sb ) , SBI_QUOTA_NEED_REPAIR ) ;
2018-06-26 13:12:43 +08:00
}
}
2019-01-27 17:59:53 -08:00
/*
* In case of checkpoint = disable , we must flush quota blocks .
* This can cause NULL exception for node_inode in end_io , since
* put_super already dropped it .
*/
sync_filesystem ( sb ) ;
2017-07-09 00:13:07 +08:00
}
2018-10-12 18:49:26 +08:00
static void f2fs_truncate_quota_inode_pages ( struct super_block * sb )
{
struct quota_info * dqopt = sb_dqopt ( sb ) ;
int type ;
for ( type = 0 ; type < MAXQUOTAS ; type + + ) {
if ( ! dqopt - > files [ type ] )
continue ;
f2fs_inode_synced ( dqopt - > files [ type ] ) ;
}
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
static int f2fs_dquot_commit ( struct dquot * dquot )
{
2019-05-29 10:58:45 -07:00
struct f2fs_sb_info * sbi = F2FS_SB ( dquot - > dq_sb ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
int ret ;
2022-01-07 12:48:44 -08:00
f2fs_down_read_nested ( & sbi - > quota_sem , SINGLE_DEPTH_NESTING ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
ret = dquot_commit ( dquot ) ;
if ( ret < 0 )
2019-05-29 10:58:45 -07:00
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
2022-01-07 12:48:44 -08:00
f2fs_up_read ( & sbi - > quota_sem ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return ret ;
}
static int f2fs_dquot_acquire ( struct dquot * dquot )
{
2019-05-29 10:58:45 -07:00
struct f2fs_sb_info * sbi = F2FS_SB ( dquot - > dq_sb ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
int ret ;
2022-01-07 12:48:44 -08:00
f2fs_down_read ( & sbi - > quota_sem ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
ret = dquot_acquire ( dquot ) ;
if ( ret < 0 )
2019-05-29 10:58:45 -07:00
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
2022-01-07 12:48:44 -08:00
f2fs_up_read ( & sbi - > quota_sem ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return ret ;
}
static int f2fs_dquot_release ( struct dquot * dquot )
{
2019-05-29 10:58:45 -07:00
struct f2fs_sb_info * sbi = F2FS_SB ( dquot - > dq_sb ) ;
2019-12-03 17:31:00 -08:00
int ret = dquot_release ( dquot ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
if ( ret < 0 )
2019-05-29 10:58:45 -07:00
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return ret ;
}
static int f2fs_dquot_mark_dquot_dirty ( struct dquot * dquot )
{
struct super_block * sb = dquot - > dq_sb ;
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2019-12-03 17:31:00 -08:00
int ret = dquot_mark_dquot_dirty ( dquot ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
/* if we are using journalled quota */
if ( is_journalled_quota ( sbi ) )
set_sbi_flag ( sbi , SBI_QUOTA_NEED_FLUSH ) ;
return ret ;
}
static int f2fs_dquot_commit_info ( struct super_block * sb , int type )
{
2019-05-29 10:58:45 -07:00
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2019-12-03 17:31:00 -08:00
int ret = dquot_commit_info ( sb , type ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
if ( ret < 0 )
2019-05-29 10:58:45 -07:00
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
return ret ;
}
2018-10-12 18:49:26 +08:00
2018-01-05 09:41:20 +00:00
static int f2fs_get_projid ( struct inode * inode , kprojid_t * projid )
2017-07-26 00:01:41 +08:00
{
* projid = F2FS_I ( inode ) - > i_projid ;
return 0 ;
}
2017-07-09 00:13:07 +08:00
static const struct dquot_operations f2fs_quota_operations = {
. get_reserved_space = f2fs_get_reserved_space ,
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
. write_dquot = f2fs_dquot_commit ,
. acquire_dquot = f2fs_dquot_acquire ,
. release_dquot = f2fs_dquot_release ,
. mark_dirty = f2fs_dquot_mark_dquot_dirty ,
. write_info = f2fs_dquot_commit_info ,
2017-07-09 00:13:07 +08:00
. alloc_dquot = dquot_alloc ,
. destroy_dquot = dquot_destroy ,
2017-07-26 00:01:41 +08:00
. get_projid = f2fs_get_projid ,
2017-07-09 00:13:07 +08:00
. get_next_id = dquot_get_next_id ,
} ;
static const struct quotactl_ops f2fs_quotactl_ops = {
. quota_on = f2fs_quota_on ,
. quota_off = f2fs_quota_off ,
. quota_sync = f2fs_quota_sync ,
. get_state = dquot_get_state ,
. set_info = dquot_set_dqinfo ,
. get_dqblk = dquot_get_dqblk ,
. set_dqblk = dquot_set_dqblk ,
. get_nextdqblk = dquot_get_next_dqblk ,
} ;
# else
2021-10-28 21:03:05 +08:00
int f2fs_dquot_initialize ( struct inode * inode )
{
return 0 ;
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
int f2fs_quota_sync ( struct super_block * sb , int type )
{
return 0 ;
}
2017-08-08 10:54:31 +08:00
void f2fs_quota_off_umount ( struct super_block * sb )
2017-07-09 00:13:07 +08:00
{
}
# endif
2017-08-31 15:06:24 +05:30
static const struct super_operations f2fs_sops = {
2012-11-02 17:07:47 +09:00
. alloc_inode = f2fs_alloc_inode ,
2019-04-15 19:29:14 -04:00
. free_inode = f2fs_free_inode ,
2013-04-30 11:33:27 +09:00
. drop_inode = f2fs_drop_inode ,
2012-11-02 17:07:47 +09:00
. write_inode = f2fs_write_inode ,
2013-06-10 09:17:01 +09:00
. dirty_inode = f2fs_dirty_inode ,
2012-11-02 17:07:47 +09:00
. show_options = f2fs_show_options ,
2017-07-09 00:13:07 +08:00
# ifdef CONFIG_QUOTA
. quota_read = f2fs_quota_read ,
. quota_write = f2fs_quota_write ,
. get_dquots = f2fs_get_dquots ,
# endif
2012-11-02 17:07:47 +09:00
. evict_inode = f2fs_evict_inode ,
. put_super = f2fs_put_super ,
. sync_fs = f2fs_sync_fs ,
2013-01-29 18:30:07 +09:00
. freeze_fs = f2fs_freeze ,
. unfreeze_fs = f2fs_unfreeze ,
2012-11-02 17:07:47 +09:00
. statfs = f2fs_statfs ,
2013-06-16 09:48:48 +09:00
. remount_fs = f2fs_remount ,
2012-11-02 17:07:47 +09:00
} ;
2018-12-12 15:20:12 +05:30
# ifdef CONFIG_FS_ENCRYPTION
2015-05-15 16:26:10 -07:00
static int f2fs_get_context ( struct inode * inode , void * ctx , size_t len )
{
return f2fs_getxattr ( inode , F2FS_XATTR_INDEX_ENCRYPTION ,
F2FS_XATTR_NAME_ENCRYPTION_CONTEXT ,
ctx , len , NULL ) ;
}
static int f2fs_set_context ( struct inode * inode , const void * ctx , size_t len ,
void * fs_data )
{
2018-03-15 18:51:41 +08:00
struct f2fs_sb_info * sbi = F2FS_I_SB ( inode ) ;
/*
* Encrypting the root directory is not allowed because fsck
* expects lost + found directory to exist and remain unencrypted
* if LOST_FOUND feature is enabled .
*
*/
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_lost_found ( sbi ) & &
2018-03-15 18:51:41 +08:00
inode - > i_ino = = F2FS_ROOT_INO ( sbi ) )
return - EPERM ;
2015-05-15 16:26:10 -07:00
return f2fs_setxattr ( inode , F2FS_XATTR_INDEX_ENCRYPTION ,
F2FS_XATTR_NAME_ENCRYPTION_CONTEXT ,
ctx , len , fs_data , XATTR_CREATE ) ;
}
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-16 21:11:35 -07:00
static const union fscrypt_policy * f2fs_get_dummy_policy ( struct super_block * sb )
2018-03-15 18:51:42 +08:00
{
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-16 21:11:35 -07:00
return F2FS_OPTION ( F2FS_SB ( sb ) ) . dummy_enc_policy . policy ;
2018-03-15 18:51:42 +08:00
}
2019-10-24 14:54:38 -07:00
static bool f2fs_has_stable_inodes ( struct super_block * sb )
{
return true ;
}
static void f2fs_get_ino_and_lblk_bits ( struct super_block * sb ,
int * ino_bits_ret , int * lblk_bits_ret )
{
* ino_bits_ret = 8 * sizeof ( nid_t ) ;
* lblk_bits_ret = 8 * sizeof ( block_t ) ;
}
2022-09-01 12:32:08 -07:00
static struct block_device * * f2fs_get_devices ( struct super_block * sb ,
unsigned int * num_devs )
2020-07-02 01:56:06 +00:00
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
2022-09-01 12:32:08 -07:00
struct block_device * * devs ;
int i ;
2020-07-02 01:56:06 +00:00
2022-09-01 12:32:08 -07:00
if ( ! f2fs_is_multi_device ( sbi ) )
return NULL ;
2020-07-02 01:56:06 +00:00
2022-09-01 12:32:08 -07:00
devs = kmalloc_array ( sbi - > s_ndevs , sizeof ( * devs ) , GFP_KERNEL ) ;
if ( ! devs )
return ERR_PTR ( - ENOMEM ) ;
2020-07-02 01:56:06 +00:00
for ( i = 0 ; i < sbi - > s_ndevs ; i + + )
2022-09-01 12:32:08 -07:00
devs [ i ] = FDEV ( i ) . bdev ;
* num_devs = sbi - > s_ndevs ;
return devs ;
2020-07-02 01:56:06 +00:00
}
2017-02-07 12:42:10 -08:00
static const struct fscrypt_operations f2fs_cryptops = {
2019-10-24 14:54:38 -07:00
. key_prefix = " f2fs: " ,
. get_context = f2fs_get_context ,
. set_context = f2fs_set_context ,
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-16 21:11:35 -07:00
. get_dummy_policy = f2fs_get_dummy_policy ,
2019-10-24 14:54:38 -07:00
. empty_dir = f2fs_empty_dir ,
. has_stable_inodes = f2fs_has_stable_inodes ,
. get_ino_and_lblk_bits = f2fs_get_ino_and_lblk_bits ,
2020-07-02 01:56:06 +00:00
. get_devices = f2fs_get_devices ,
2015-05-15 16:26:10 -07:00
} ;
# endif
2012-11-02 17:07:47 +09:00
static struct inode * f2fs_nfs_get_inode ( struct super_block * sb ,
u64 ino , u32 generation )
{
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
struct inode * inode ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
if ( f2fs_check_nid_range ( sbi , ino ) )
2014-03-12 17:08:36 +08:00
return ERR_PTR ( - ESTALE ) ;
2012-11-02 17:07:47 +09:00
/*
* f2fs_iget isn ' t quite right if the inode is currently unallocated !
* However f2fs_iget currently does appropriate checks to handle stale
* inodes so everything is OK .
*/
inode = f2fs_iget ( sb , ino ) ;
if ( IS_ERR ( inode ) )
return ERR_CAST ( inode ) ;
2013-12-06 15:00:58 +09:00
if ( unlikely ( generation & & inode - > i_generation ! = generation ) ) {
2012-11-02 17:07:47 +09:00
/* we didn't find the right inode.. */
iput ( inode ) ;
return ERR_PTR ( - ESTALE ) ;
}
return inode ;
}
static struct dentry * f2fs_fh_to_dentry ( struct super_block * sb , struct fid * fid ,
int fh_len , int fh_type )
{
return generic_fh_to_dentry ( sb , fid , fh_len , fh_type ,
f2fs_nfs_get_inode ) ;
}
static struct dentry * f2fs_fh_to_parent ( struct super_block * sb , struct fid * fid ,
int fh_len , int fh_type )
{
return generic_fh_to_parent ( sb , fid , fh_len , fh_type ,
f2fs_nfs_get_inode ) ;
}
static const struct export_operations f2fs_export_ops = {
. fh_to_dentry = f2fs_fh_to_dentry ,
. fh_to_parent = f2fs_fh_to_parent ,
. get_parent = f2fs_get_parent ,
} ;
2021-01-13 13:21:54 +08:00
loff_t max_file_blocks ( struct inode * inode )
2012-11-02 17:07:47 +09:00
{
2017-07-19 00:19:06 +08:00
loff_t result = 0 ;
2021-01-13 13:21:54 +08:00
loff_t leaf_count ;
2012-11-02 17:07:47 +09:00
2017-07-19 00:19:06 +08:00
/*
* note : previously , result is equal to ( DEF_ADDRS_PER_INODE -
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-06 21:59:50 +08:00
* DEFAULT_INLINE_XATTR_ADDRS ) , but now f2fs try to reserve more
2017-07-19 00:19:06 +08:00
* space in inode . i_addr , it will be more safe to reassign
* result as zero .
*/
2021-01-13 13:21:54 +08:00
if ( inode & & f2fs_compressed_file ( inode ) )
leaf_count = ADDRS_PER_BLOCK ( inode ) ;
else
leaf_count = DEF_ADDRS_PER_BLOCK ;
2012-11-02 17:07:47 +09:00
/* two direct node blocks */
result + = ( leaf_count * 2 ) ;
/* two indirect node blocks */
leaf_count * = NIDS_PER_BLOCK ;
result + = ( leaf_count * 2 ) ;
/* one double indirect node block */
leaf_count * = NIDS_PER_BLOCK ;
result + = leaf_count ;
return result ;
}
2016-03-20 15:33:20 -07:00
static int __f2fs_commit_super ( struct buffer_head * bh ,
struct f2fs_super_block * super )
{
lock_buffer ( bh ) ;
if ( super )
memcpy ( bh - > b_data + F2FS_SUPER_OFFSET , super , sizeof ( * super ) ) ;
set_buffer_dirty ( bh ) ;
unlock_buffer ( bh ) ;
/* it's rare case, we can do fua all the time */
2017-05-02 17:03:47 +02:00
return __sync_dirty_buffer ( bh , REQ_SYNC | REQ_PREFLUSH | REQ_FUA ) ;
2016-03-20 15:33:20 -07:00
}
2016-03-23 17:05:27 -07:00
static inline bool sanity_check_area_boundary ( struct f2fs_sb_info * sbi ,
2016-03-20 15:33:20 -07:00
struct buffer_head * bh )
2015-12-15 09:58:18 +08:00
{
2016-03-20 15:33:20 -07:00
struct f2fs_super_block * raw_super = ( struct f2fs_super_block * )
( bh - > b_data + F2FS_SUPER_OFFSET ) ;
2016-03-23 17:05:27 -07:00
struct super_block * sb = sbi - > sb ;
2015-12-15 09:58:18 +08:00
u32 segment0_blkaddr = le32_to_cpu ( raw_super - > segment0_blkaddr ) ;
u32 cp_blkaddr = le32_to_cpu ( raw_super - > cp_blkaddr ) ;
u32 sit_blkaddr = le32_to_cpu ( raw_super - > sit_blkaddr ) ;
u32 nat_blkaddr = le32_to_cpu ( raw_super - > nat_blkaddr ) ;
u32 ssa_blkaddr = le32_to_cpu ( raw_super - > ssa_blkaddr ) ;
u32 main_blkaddr = le32_to_cpu ( raw_super - > main_blkaddr ) ;
u32 segment_count_ckpt = le32_to_cpu ( raw_super - > segment_count_ckpt ) ;
u32 segment_count_sit = le32_to_cpu ( raw_super - > segment_count_sit ) ;
u32 segment_count_nat = le32_to_cpu ( raw_super - > segment_count_nat ) ;
u32 segment_count_ssa = le32_to_cpu ( raw_super - > segment_count_ssa ) ;
u32 segment_count_main = le32_to_cpu ( raw_super - > segment_count_main ) ;
u32 segment_count = le32_to_cpu ( raw_super - > segment_count ) ;
u32 log_blocks_per_seg = le32_to_cpu ( raw_super - > log_blocks_per_seg ) ;
2016-03-20 15:33:20 -07:00
u64 main_end_blkaddr = main_blkaddr +
( segment_count_main < < log_blocks_per_seg ) ;
u64 seg_end_blkaddr = segment0_blkaddr +
( segment_count < < log_blocks_per_seg ) ;
2015-12-15 09:58:18 +08:00
if ( segment0_blkaddr ! = cp_blkaddr ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Mismatch start address, segment0(%u) cp_blkaddr(%u) " ,
segment0_blkaddr , cp_blkaddr ) ;
2015-12-15 09:58:18 +08:00
return true ;
}
if ( cp_blkaddr + ( segment_count_ckpt < < log_blocks_per_seg ) ! =
sit_blkaddr ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Wrong CP boundary, start(%u) end(%u) blocks(%u) " ,
cp_blkaddr , sit_blkaddr ,
segment_count_ckpt < < log_blocks_per_seg ) ;
2015-12-15 09:58:18 +08:00
return true ;
}
if ( sit_blkaddr + ( segment_count_sit < < log_blocks_per_seg ) ! =
nat_blkaddr ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Wrong SIT boundary, start(%u) end(%u) blocks(%u) " ,
sit_blkaddr , nat_blkaddr ,
segment_count_sit < < log_blocks_per_seg ) ;
2015-12-15 09:58:18 +08:00
return true ;
}
if ( nat_blkaddr + ( segment_count_nat < < log_blocks_per_seg ) ! =
ssa_blkaddr ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Wrong NAT boundary, start(%u) end(%u) blocks(%u) " ,
nat_blkaddr , ssa_blkaddr ,
segment_count_nat < < log_blocks_per_seg ) ;
2015-12-15 09:58:18 +08:00
return true ;
}
if ( ssa_blkaddr + ( segment_count_ssa < < log_blocks_per_seg ) ! =
main_blkaddr ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Wrong SSA boundary, start(%u) end(%u) blocks(%u) " ,
ssa_blkaddr , main_blkaddr ,
segment_count_ssa < < log_blocks_per_seg ) ;
2015-12-15 09:58:18 +08:00
return true ;
}
2016-03-20 15:33:20 -07:00
if ( main_end_blkaddr > seg_end_blkaddr ) {
2020-09-18 08:31:24 +08:00
f2fs_info ( sbi , " Wrong MAIN_AREA boundary, start(%u) end(%llu) block(%u) " ,
main_blkaddr , seg_end_blkaddr ,
2019-06-18 17:48:42 +08:00
segment_count_main < < log_blocks_per_seg ) ;
2015-12-15 09:58:18 +08:00
return true ;
2016-03-20 15:33:20 -07:00
} else if ( main_end_blkaddr < seg_end_blkaddr ) {
int err = 0 ;
char * res ;
/* fix in-memory information all the time */
raw_super - > segment_count = cpu_to_le32 ( ( main_end_blkaddr -
segment0_blkaddr ) > > log_blocks_per_seg ) ;
if ( f2fs_readonly ( sb ) | | bdev_read_only ( sb - > s_bdev ) ) {
2016-03-23 17:05:27 -07:00
set_sbi_flag ( sbi , SBI_NEED_SB_WRITE ) ;
2016-03-20 15:33:20 -07:00
res = " internally " ;
} else {
err = __f2fs_commit_super ( bh , NULL ) ;
res = err ? " failed " : " done " ;
}
2020-09-18 08:31:24 +08:00
f2fs_info ( sbi , " Fix alignment : %s, start(%u) end(%llu) block(%u) " ,
res , main_blkaddr , seg_end_blkaddr ,
2019-06-18 17:48:42 +08:00
segment_count_main < < log_blocks_per_seg ) ;
2016-03-20 15:33:20 -07:00
if ( err )
return true ;
2015-12-15 09:58:18 +08:00
}
return false ;
}
2016-03-23 17:05:27 -07:00
static int sanity_check_raw_super ( struct f2fs_sb_info * sbi ,
2016-03-20 15:33:20 -07:00
struct buffer_head * bh )
2012-11-02 17:07:47 +09:00
{
2020-09-17 19:11:58 +08:00
block_t segment_count , segs_per_sec , secs_per_zone , segment_count_main ;
2018-04-27 19:03:22 -07:00
block_t total_sections , blocks_per_seg ;
2016-03-20 15:33:20 -07:00
struct f2fs_super_block * raw_super = ( struct f2fs_super_block * )
( bh - > b_data + F2FS_SUPER_OFFSET ) ;
2018-09-28 20:25:56 +08:00
size_t crc_offset = 0 ;
__u32 crc = 0 ;
2019-07-25 11:08:52 +08:00
if ( le32_to_cpu ( raw_super - > magic ) ! = F2FS_SUPER_MAGIC ) {
f2fs_info ( sbi , " Magic Mismatch, valid(0x%x) - read(0x%x) " ,
F2FS_SUPER_MAGIC , le32_to_cpu ( raw_super - > magic ) ) ;
return - EINVAL ;
}
2018-09-28 20:25:56 +08:00
/* Check checksum_offset and crc in superblock */
2018-10-24 18:34:26 +08:00
if ( __F2FS_HAS_FEATURE ( raw_super , F2FS_FEATURE_SB_CHKSUM ) ) {
2018-09-28 20:25:56 +08:00
crc_offset = le32_to_cpu ( raw_super - > checksum_offset ) ;
if ( crc_offset ! =
offsetof ( struct f2fs_super_block , crc ) ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid SB checksum offset: %zu " ,
crc_offset ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-09-28 20:25:56 +08:00
}
crc = le32_to_cpu ( raw_super - > crc ) ;
if ( ! f2fs_crc_valid ( sbi , crc , raw_super , crc_offset ) ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid SB checksum value: %u " , crc ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-09-28 20:25:56 +08:00
}
}
2012-11-02 17:07:47 +09:00
/* Currently, support only 4KB block size */
2020-12-09 16:49:36 +08:00
if ( le32_to_cpu ( raw_super - > log_blocksize ) ! = F2FS_BLKSIZE_BITS ) {
f2fs_info ( sbi , " Invalid log_blocksize (%u), supports only %u " ,
le32_to_cpu ( raw_super - > log_blocksize ) ,
F2FS_BLKSIZE_BITS ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2012-12-30 14:52:05 +09:00
}
2013-02-01 19:07:57 +08:00
2015-12-15 09:58:18 +08:00
/* check log blocks per segment */
if ( le32_to_cpu ( raw_super - > log_blocks_per_seg ) ! = 9 ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid log blocks per segment (%u) " ,
le32_to_cpu ( raw_super - > log_blocks_per_seg ) ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2015-12-15 09:58:18 +08:00
}
2014-09-15 18:01:10 +08:00
/* Currently, support 512/1024/2048/4096 bytes sector size */
if ( le32_to_cpu ( raw_super - > log_sectorsize ) >
F2FS_MAX_LOG_SECTOR_SIZE | |
le32_to_cpu ( raw_super - > log_sectorsize ) <
F2FS_MIN_LOG_SECTOR_SIZE ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid log sectorsize (%u) " ,
le32_to_cpu ( raw_super - > log_sectorsize ) ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2012-12-30 14:52:05 +09:00
}
2014-09-15 18:01:10 +08:00
if ( le32_to_cpu ( raw_super - > log_sectors_per_block ) +
le32_to_cpu ( raw_super - > log_sectorsize ) ! =
F2FS_MAX_LOG_SECTOR_SIZE ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid log sectors per block(%u) log sectorsize(%u) " ,
le32_to_cpu ( raw_super - > log_sectors_per_block ) ,
le32_to_cpu ( raw_super - > log_sectorsize ) ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2012-12-30 14:52:05 +09:00
}
2015-12-15 09:58:18 +08:00
2018-04-27 19:03:22 -07:00
segment_count = le32_to_cpu ( raw_super - > segment_count ) ;
2020-09-17 19:11:58 +08:00
segment_count_main = le32_to_cpu ( raw_super - > segment_count_main ) ;
2018-04-27 19:03:22 -07:00
segs_per_sec = le32_to_cpu ( raw_super - > segs_per_sec ) ;
secs_per_zone = le32_to_cpu ( raw_super - > secs_per_zone ) ;
total_sections = le32_to_cpu ( raw_super - > section_count ) ;
/* blocks_per_seg should be 512, given the above check */
blocks_per_seg = 1 < < le32_to_cpu ( raw_super - > log_blocks_per_seg ) ;
if ( segment_count > F2FS_MAX_SEGMENT | |
segment_count < F2FS_MIN_SEGMENTS ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid segment count (%u) " , segment_count ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
2020-09-17 19:11:58 +08:00
if ( total_sections > segment_count_main | | total_sections < 1 | |
2018-04-27 19:03:22 -07:00
segs_per_sec > segment_count | | ! segs_per_sec ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid segment/section count (%u, %u x %u) " ,
segment_count , total_sections , segs_per_sec ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
2020-09-29 09:23:34 +08:00
if ( segment_count_main ! = total_sections * segs_per_sec ) {
f2fs_info ( sbi , " Invalid segment/section count (%u != %u * %u) " ,
segment_count_main , total_sections , segs_per_sec ) ;
return - EFSCORRUPTED ;
}
2018-04-27 19:03:22 -07:00
if ( ( segment_count / segs_per_sec ) < total_sections ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Small segment_count (%u < %u * %u) " ,
segment_count , segs_per_sec , total_sections ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
2018-12-22 11:22:26 +01:00
if ( segment_count > ( le64_to_cpu ( raw_super - > block_count ) > > 9 ) ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Wrong segment_count / block_count (%u > %llu) " ,
segment_count , le64_to_cpu ( raw_super - > block_count ) ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
2019-09-23 12:22:35 +08:00
if ( RDEV ( 0 ) . path [ 0 ] ) {
block_t dev_seg_count = le32_to_cpu ( RDEV ( 0 ) . total_segments ) ;
int i = 1 ;
while ( i < MAX_DEVICES & & RDEV ( i ) . path [ 0 ] ) {
dev_seg_count + = le32_to_cpu ( RDEV ( i ) . total_segments ) ;
i + + ;
}
if ( segment_count ! = dev_seg_count ) {
f2fs_info ( sbi , " Segment count (%u) mismatch with total segments from devices (%u) " ,
segment_count , dev_seg_count ) ;
return - EFSCORRUPTED ;
}
2020-09-21 20:53:13 +08:00
} else {
if ( __F2FS_HAS_FEATURE ( raw_super , F2FS_FEATURE_BLKZONED ) & &
! bdev_is_zoned ( sbi - > sb - > s_bdev ) ) {
f2fs_info ( sbi , " Zoned block device path is missing " ) ;
return - EFSCORRUPTED ;
}
2019-09-23 12:22:35 +08:00
}
2018-06-23 00:12:36 +08:00
if ( secs_per_zone > total_sections | | ! secs_per_zone ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Wrong secs_per_zone / total_sections (%u, %u) " ,
secs_per_zone , total_sections ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
if ( le32_to_cpu ( raw_super - > extension_count ) > F2FS_MAX_EXTENSION | |
raw_super - > hot_ext_count > F2FS_MAX_EXTENSION | |
( le32_to_cpu ( raw_super - > extension_count ) +
raw_super - > hot_ext_count ) > F2FS_MAX_EXTENSION ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Corrupted extension count (%u + %u > %u) " ,
le32_to_cpu ( raw_super - > extension_count ) ,
raw_super - > hot_ext_count ,
F2FS_MAX_EXTENSION ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
2021-08-06 08:04:37 +08:00
if ( le32_to_cpu ( raw_super - > cp_payload ) > =
( blocks_per_seg - F2FS_CP_PACKS -
NR_CURSEG_PERSIST_TYPE ) ) {
f2fs_info ( sbi , " Insane cp_payload (%u >= %u) " ,
2019-06-18 17:48:42 +08:00
le32_to_cpu ( raw_super - > cp_payload ) ,
2021-08-06 08:04:37 +08:00
blocks_per_seg - F2FS_CP_PACKS -
NR_CURSEG_PERSIST_TYPE ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2018-04-27 19:03:22 -07:00
}
2015-12-15 09:58:18 +08:00
/* check reserved ino info */
if ( le32_to_cpu ( raw_super - > node_ino ) ! = 1 | |
le32_to_cpu ( raw_super - > meta_ino ) ! = 2 | |
le32_to_cpu ( raw_super - > root_ino ) ! = 3 ) {
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Invalid Fs Meta Ino: node(%u) meta(%u) root(%u) " ,
le32_to_cpu ( raw_super - > node_ino ) ,
le32_to_cpu ( raw_super - > meta_ino ) ,
le32_to_cpu ( raw_super - > root_ino ) ) ;
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2015-12-15 09:58:18 +08:00
}
/* check CP/SIT/NAT/SSA/MAIN_AREA area boundary */
2016-03-23 17:05:27 -07:00
if ( sanity_check_area_boundary ( sbi , bh ) )
2019-07-25 11:08:52 +08:00
return - EFSCORRUPTED ;
2015-12-15 09:58:18 +08:00
2012-11-02 17:07:47 +09:00
return 0 ;
}
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
int f2fs_sanity_check_ckpt ( struct f2fs_sb_info * sbi )
2012-11-02 17:07:47 +09:00
{
unsigned int total , fsmeta ;
f2fs: prevent checkpoint once any IO failure is detected
This patch enhances the checkpoint routine to cope with IO errors.
Basically f2fs detects IO errors from end_io_write, and the errors are able to
be occurred during one of data, node, and meta page writes.
In the previous code, when an IO error is occurred during writes, f2fs sets a
flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
Afterwards, write_checkpoint() will check the flag and remount f2fs as a
read-only (ro) mode.
However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
freely able to be written to disk by flusher or kswapd in background.
In such a case, after cold reboot, f2fs would restore the checkpoint data having
CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
a ro mode again.
Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
occurred, and remount f2fs as a ro mode right away at that moment.
Reported-by: Oliver Winker <oliver@oli1170.net>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2013-01-24 19:56:11 +09:00
struct f2fs_super_block * raw_super = F2FS_RAW_SUPER ( sbi ) ;
struct f2fs_checkpoint * ckpt = F2FS_CKPT ( sbi ) ;
2016-12-05 13:56:04 -08:00
unsigned int ovp_segments , reserved_segments ;
2017-05-15 10:45:08 -07:00
unsigned int main_segs , blocks_per_seg ;
f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.
- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image
- Reproduce
- Kernel message
[ 548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.584979] ==================================================================
[ 548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[ 548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295
[ 548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[ 548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.589438] Call Trace:
[ 548.589474] dump_stack+0x7b/0xb5
[ 548.589487] print_address_description+0x70/0x290
[ 548.589492] kasan_report+0x291/0x390
[ 548.589496] ? kmemdup+0x36/0x50
[ 548.589509] check_memory_region+0x139/0x190
[ 548.589514] memcpy+0x23/0x50
[ 548.589518] kmemdup+0x36/0x50
[ 548.589545] f2fs_build_segment_manager+0x8fa/0x3410
[ 548.589551] ? __asan_loadN+0xf/0x20
[ 548.589560] ? f2fs_sanity_check_ckpt+0x1be/0x240
[ 548.589566] ? f2fs_flush_sit_entries+0x10c0/0x10c0
[ 548.589587] ? __put_user_ns+0x40/0x40
[ 548.589604] ? find_next_bit+0x57/0x90
[ 548.589610] f2fs_fill_super+0x194b/0x2b40
[ 548.589617] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589637] ? set_blocksize+0x90/0x140
[ 548.589651] mount_bdev+0x1c5/0x210
[ 548.589655] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589667] f2fs_mount+0x15/0x20
[ 548.589672] mount_fs+0x60/0x1a0
[ 548.589683] ? alloc_vfsmnt+0x309/0x360
[ 548.589688] vfs_kern_mount+0x6b/0x1a0
[ 548.589699] do_mount+0x34a/0x18c0
[ 548.589710] ? lockref_put_or_lock+0xcf/0x160
[ 548.589716] ? copy_mount_string+0x20/0x20
[ 548.589728] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.589734] ? kasan_check_write+0x14/0x20
[ 548.589740] ? _copy_from_user+0x6a/0x90
[ 548.589744] ? memdup_user+0x42/0x60
[ 548.589750] ksys_mount+0x83/0xd0
[ 548.589755] __x64_sys_mount+0x67/0x80
[ 548.589781] do_syscall_64+0x78/0x170
[ 548.589797] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.589820] RIP: 0033:0x7f76fc331b9a
[ 548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.590242] The buggy address belongs to the page:
[ 548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 548.592886] flags: 0x2ffff0000000000()
[ 548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[ 548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 548.603713] page dumped because: kasan: bad access detected
[ 548.605203] Memory state around the buggy address:
[ 548.606198] ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.607676] ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.610629] ^
[ 548.612088] ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.613674] ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.615141] ==================================================================
[ 548.616613] Disabling lock debugging due to kernel taint
[ 548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[ 548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[ 548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G B 4.18.0-rc1+ #4
[ 548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[ 548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[ 548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[ 548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[ 548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[ 548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[ 548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[ 548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[ 548.623299] FS: 00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 548.623302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[ 548.623317] Call Trace:
[ 548.623325] ? kasan_check_read+0x11/0x20
[ 548.623330] ? __zone_watermark_ok+0x92/0x240
[ 548.623336] ? get_page_from_freelist+0x1c3/0x1d90
[ 548.623347] ? _raw_spin_lock_irqsave+0x2a/0x60
[ 548.623353] ? warn_alloc+0x250/0x250
[ 548.623358] ? save_stack+0x46/0xd0
[ 548.623361] ? kasan_kmalloc+0xad/0xe0
[ 548.623366] ? __isolate_free_page+0x2a0/0x2a0
[ 548.623370] ? mount_fs+0x60/0x1a0
[ 548.623374] ? vfs_kern_mount+0x6b/0x1a0
[ 548.623378] ? do_mount+0x34a/0x18c0
[ 548.623383] ? ksys_mount+0x83/0xd0
[ 548.623387] ? __x64_sys_mount+0x67/0x80
[ 548.623391] ? do_syscall_64+0x78/0x170
[ 548.623396] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623401] __alloc_pages_nodemask+0x3c5/0x400
[ 548.623407] ? __alloc_pages_slowpath+0x1420/0x1420
[ 548.623412] ? __mutex_lock_slowpath+0x20/0x20
[ 548.623417] ? kvmalloc_node+0x31/0x80
[ 548.623424] alloc_pages_current+0x75/0x110
[ 548.623436] kmalloc_order+0x24/0x60
[ 548.623442] kmalloc_order_trace+0x24/0xb0
[ 548.623448] __kmalloc_track_caller+0x207/0x220
[ 548.623455] ? f2fs_build_node_manager+0x399/0xbb0
[ 548.623460] kmemdup+0x20/0x50
[ 548.623465] f2fs_build_node_manager+0x399/0xbb0
[ 548.623470] f2fs_fill_super+0x195e/0x2b40
[ 548.623477] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623481] ? set_blocksize+0x90/0x140
[ 548.623486] mount_bdev+0x1c5/0x210
[ 548.623489] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623495] f2fs_mount+0x15/0x20
[ 548.623498] mount_fs+0x60/0x1a0
[ 548.623503] ? alloc_vfsmnt+0x309/0x360
[ 548.623508] vfs_kern_mount+0x6b/0x1a0
[ 548.623513] do_mount+0x34a/0x18c0
[ 548.623518] ? lockref_put_or_lock+0xcf/0x160
[ 548.623523] ? copy_mount_string+0x20/0x20
[ 548.623528] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.623533] ? kasan_check_write+0x14/0x20
[ 548.623537] ? _copy_from_user+0x6a/0x90
[ 548.623542] ? memdup_user+0x42/0x60
[ 548.623547] ksys_mount+0x83/0xd0
[ 548.623552] __x64_sys_mount+0x67/0x80
[ 548.623557] do_syscall_64+0x78/0x170
[ 548.623562] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623566] RIP: 0033:0x7f76fc331b9a
[ 548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[ 548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[ 548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager
- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578
sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);
Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.
Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-23 11:25:19 +08:00
unsigned int sit_segs , nat_segs ;
unsigned int sit_bitmap_size , nat_bitmap_size ;
unsigned int log_blocks_per_seg ;
2018-06-27 18:05:54 +08:00
unsigned int segment_count_main ;
f2fs: fix to do sanity check with cp_pack_start_sum
After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.
Add sanity check for cp_pack_start_sum to fix this issue.
https://bugzilla.kernel.org/show_bug.cgi?id=200419
- Reproduce
- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)
[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 __get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:ffff88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: ffff88018f0a6a78 RBX: ffffea0007a46600 RCX: ffffffff9314d1b2
[ 3117.584085] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff88018f0a6a98
[ 3117.584087] RBP: ffff88018ebe9980 R08: 0000000000000002 R09: 0000000000000001
[ 3117.584090] R10: 0000000000000001 R11: ffffed00326e4450 R12: ffff880193722200
[ 3117.584092] R13: ffff88018ebe9afc R14: 0000000000000206 R15: ffff88018eb67900
[ 3117.584096] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.584098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.584101] CR2: 00000000016f21b8 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.584112] Call Trace:
[ 3117.584121] ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127] ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133] ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145] f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151] ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156] ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184] ? map_id_range_down+0x17c/0x1b0
[ 3117.584188] ? __put_user_ns+0x30/0x30
[ 3117.584206] ? find_next_bit+0x53/0x90
[ 3117.584237] ? cpumask_next+0x16/0x20
[ 3117.584249] f2fs_fill_super+0x1948/0x2b40
[ 3117.584258] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279] ? sget_userns+0x65e/0x690
[ 3117.584296] ? set_blocksize+0x88/0x130
[ 3117.584302] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305] mount_bdev+0x1c0/0x200
[ 3117.584310] mount_fs+0x5c/0x190
[ 3117.584320] vfs_kern_mount+0x64/0x190
[ 3117.584330] do_mount+0x2e4/0x1450
[ 3117.584343] ? lockref_put_return+0x130/0x130
[ 3117.584347] ? copy_mount_string+0x20/0x20
[ 3117.584357] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362] ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377] ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383] ? _copy_from_user+0x61/0x90
[ 3117.584396] ? memdup_user+0x3e/0x60
[ 3117.584401] ksys_mount+0x7e/0xd0
[ 3117.584405] __x64_sys_mount+0x62/0x70
[ 3117.584427] do_syscall_64+0x73/0x160
[ 3117.584440] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.584510] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.584512] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.584514] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.584516] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.584519] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_data: ino = 2 (i_size: recover) recovered = 1, err = 0
[ 3117.685707] ==================================================================
[ 3117.685955] BUG: KASAN: slab-out-of-bounds in __remove_dirty_segment+0xdd/0x1e0
[ 3117.686175] Read of size 8 at addr ffff88018f0a63d0 by task mount/1225
[ 3117.686477] CPU: 0 PID: 1225 Comm: mount Tainted: G W 4.17.0+ #1
[ 3117.686481] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.686483] Call Trace:
[ 3117.686494] dump_stack+0x71/0xab
[ 3117.686512] print_address_description+0x6b/0x290
[ 3117.686517] kasan_report+0x28e/0x390
[ 3117.686522] ? __remove_dirty_segment+0xdd/0x1e0
[ 3117.686527] __remove_dirty_segment+0xdd/0x1e0
[ 3117.686532] locate_dirty_segment+0x189/0x190
[ 3117.686538] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.686543] recover_data+0x703/0x2c20
[ 3117.686547] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.686553] ? ksys_mount+0x7e/0xd0
[ 3117.686564] ? policy_nodemask+0x1a/0x90
[ 3117.686567] ? policy_node+0x56/0x70
[ 3117.686571] ? add_fsync_inode+0xf0/0xf0
[ 3117.686592] ? blk_finish_plug+0x44/0x60
[ 3117.686597] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.686602] ? find_inode_fast+0xac/0xc0
[ 3117.686606] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.686618] ? __radix_tree_lookup+0x150/0x150
[ 3117.686633] ? dqget+0x670/0x670
[ 3117.686648] ? pagecache_get_page+0x29/0x410
[ 3117.686656] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.686660] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.686664] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.686670] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.686674] ? rb_insert_color+0x323/0x3d0
[ 3117.686678] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.686683] ? proc_register+0x153/0x1d0
[ 3117.686686] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.686695] ? f2fs_attr_store+0x50/0x50
[ 3117.686700] ? proc_create_single_data+0x52/0x60
[ 3117.686707] f2fs_fill_super+0x1d06/0x2b40
[ 3117.686728] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686735] ? sget_userns+0x65e/0x690
[ 3117.686740] ? set_blocksize+0x88/0x130
[ 3117.686745] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686748] mount_bdev+0x1c0/0x200
[ 3117.686753] mount_fs+0x5c/0x190
[ 3117.686758] vfs_kern_mount+0x64/0x190
[ 3117.686762] do_mount+0x2e4/0x1450
[ 3117.686769] ? lockref_put_return+0x130/0x130
[ 3117.686773] ? copy_mount_string+0x20/0x20
[ 3117.686777] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.686780] ? kasan_kmalloc+0xa6/0xd0
[ 3117.686786] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.686790] ? __kmalloc_track_caller+0x196/0x210
[ 3117.686795] ? _copy_from_user+0x61/0x90
[ 3117.686801] ? memdup_user+0x3e/0x60
[ 3117.686804] ksys_mount+0x7e/0xd0
[ 3117.686809] __x64_sys_mount+0x62/0x70
[ 3117.686816] do_syscall_64+0x73/0x160
[ 3117.686824] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.686829] RIP: 0033:0x7f5693f14b9a
[ 3117.686830] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.686887] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.686892] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.686894] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.686896] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.686899] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.686901] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.687005] Allocated by task 1225:
[ 3117.687152] kasan_kmalloc+0xa6/0xd0
[ 3117.687157] kmem_cache_alloc_trace+0xfd/0x200
[ 3117.687161] f2fs_build_segment_manager+0x2d09/0x3190
[ 3117.687165] f2fs_fill_super+0x1948/0x2b40
[ 3117.687168] mount_bdev+0x1c0/0x200
[ 3117.687171] mount_fs+0x5c/0x190
[ 3117.687174] vfs_kern_mount+0x64/0x190
[ 3117.687177] do_mount+0x2e4/0x1450
[ 3117.687180] ksys_mount+0x7e/0xd0
[ 3117.687182] __x64_sys_mount+0x62/0x70
[ 3117.687186] do_syscall_64+0x73/0x160
[ 3117.687190] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.687285] Freed by task 19:
[ 3117.687412] __kasan_slab_free+0x137/0x190
[ 3117.687416] kfree+0x8b/0x1b0
[ 3117.687460] ttm_bo_man_put_node+0x61/0x80 [ttm]
[ 3117.687476] ttm_bo_cleanup_refs+0x15f/0x250 [ttm]
[ 3117.687492] ttm_bo_delayed_delete+0x2f0/0x300 [ttm]
[ 3117.687507] ttm_bo_delayed_workqueue+0x17/0x50 [ttm]
[ 3117.687528] process_one_work+0x2f9/0x740
[ 3117.687531] worker_thread+0x78/0x6b0
[ 3117.687541] kthread+0x177/0x1c0
[ 3117.687545] ret_from_fork+0x35/0x40
[ 3117.687638] The buggy address belongs to the object at ffff88018f0a6300
which belongs to the cache kmalloc-192 of size 192
[ 3117.688014] The buggy address is located 16 bytes to the right of
192-byte region [ffff88018f0a6300, ffff88018f0a63c0)
[ 3117.688382] The buggy address belongs to the page:
[ 3117.688554] page:ffffea00063c2980 count:1 mapcount:0 mapping:ffff8801f3403180 index:0x0
[ 3117.688788] flags: 0x17fff8000000100(slab)
[ 3117.688944] raw: 017fff8000000100 ffffea00063c2840 0000000e0000000e ffff8801f3403180
[ 3117.689166] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 3117.689386] page dumped because: kasan: bad access detected
[ 3117.689653] Memory state around the buggy address:
[ 3117.689816] ffff88018f0a6280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 3117.690027] ffff88018f0a6300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690239] >ffff88018f0a6380: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.690448] ^
[ 3117.690644] ffff88018f0a6400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690868] ffff88018f0a6480: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.691077] ==================================================================
[ 3117.691290] Disabling lock debugging due to kernel taint
[ 3117.693893] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 3117.694120] PGD 80000001f01bc067 P4D 80000001f01bc067 PUD 1d9638067 PMD 0
[ 3117.694338] Oops: 0002 [#1] SMP KASAN PTI
[ 3117.694490] CPU: 1 PID: 1225 Comm: mount Tainted: G B W 4.17.0+ #1
[ 3117.694703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.695073] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.695246] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.695793] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.695969] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.696182] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.696391] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.696604] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.696813] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.697032] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.697280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.702357] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.707235] Call Trace:
[ 3117.712077] locate_dirty_segment+0x189/0x190
[ 3117.716891] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.721617] recover_data+0x703/0x2c20
[ 3117.726316] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.730957] ? ksys_mount+0x7e/0xd0
[ 3117.735573] ? policy_nodemask+0x1a/0x90
[ 3117.740198] ? policy_node+0x56/0x70
[ 3117.744829] ? add_fsync_inode+0xf0/0xf0
[ 3117.749487] ? blk_finish_plug+0x44/0x60
[ 3117.754152] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.758831] ? find_inode_fast+0xac/0xc0
[ 3117.763448] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.768046] ? __radix_tree_lookup+0x150/0x150
[ 3117.772603] ? dqget+0x670/0x670
[ 3117.777159] ? pagecache_get_page+0x29/0x410
[ 3117.781648] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.786067] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.790476] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.794790] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.799086] ? rb_insert_color+0x323/0x3d0
[ 3117.803304] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.807563] ? proc_register+0x153/0x1d0
[ 3117.811766] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.815947] ? f2fs_attr_store+0x50/0x50
[ 3117.820087] ? proc_create_single_data+0x52/0x60
[ 3117.824262] f2fs_fill_super+0x1d06/0x2b40
[ 3117.828367] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.832432] ? sget_userns+0x65e/0x690
[ 3117.836500] ? set_blocksize+0x88/0x130
[ 3117.840501] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.844420] mount_bdev+0x1c0/0x200
[ 3117.848275] mount_fs+0x5c/0x190
[ 3117.852053] vfs_kern_mount+0x64/0x190
[ 3117.855810] do_mount+0x2e4/0x1450
[ 3117.859441] ? lockref_put_return+0x130/0x130
[ 3117.862996] ? copy_mount_string+0x20/0x20
[ 3117.866417] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.869719] ? kasan_kmalloc+0xa6/0xd0
[ 3117.872948] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.876121] ? __kmalloc_track_caller+0x196/0x210
[ 3117.879333] ? _copy_from_user+0x61/0x90
[ 3117.882467] ? memdup_user+0x3e/0x60
[ 3117.885604] ksys_mount+0x7e/0xd0
[ 3117.888700] __x64_sys_mount+0x62/0x70
[ 3117.891742] do_syscall_64+0x73/0x160
[ 3117.894692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.897669] RIP: 0033:0x7f5693f14b9a
[ 3117.900563] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.906922] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.910159] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.913469] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.916764] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.920071] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.923393] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.926680] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.949979] CR2: 0000000000000000
[ 3117.954283] ---[ end trace a8e0d899985faf32 ]---
[ 3117.958575] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.962810] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.971789] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.976333] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.980926] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.985497] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.990098] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.994761] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.999392] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3118.004096] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3118.008816] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/segment.c#L775
if (test_and_clear_bit(segno, dirty_i->dirty_segmap[t]))
dirty_i->nr_dirty[t]--;
Here dirty_i->dirty_segmap[t] can be NULL which leads to crash in test_and_clear_bit()
Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 19:16:11 +08:00
unsigned int cp_pack_start_sum , cp_payload ;
2019-04-15 15:30:50 +08:00
block_t user_block_count , valid_user_blocks ;
block_t avail_node_count , valid_node_count ;
2021-08-06 08:04:37 +08:00
unsigned int nat_blocks , nat_bits_bytes , nat_bits_blocks ;
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
int i , j ;
2012-11-02 17:07:47 +09:00
total = le32_to_cpu ( raw_super - > segment_count ) ;
fsmeta = le32_to_cpu ( raw_super - > segment_count_ckpt ) ;
f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.
- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image
- Reproduce
- Kernel message
[ 548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.584979] ==================================================================
[ 548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[ 548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295
[ 548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[ 548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.589438] Call Trace:
[ 548.589474] dump_stack+0x7b/0xb5
[ 548.589487] print_address_description+0x70/0x290
[ 548.589492] kasan_report+0x291/0x390
[ 548.589496] ? kmemdup+0x36/0x50
[ 548.589509] check_memory_region+0x139/0x190
[ 548.589514] memcpy+0x23/0x50
[ 548.589518] kmemdup+0x36/0x50
[ 548.589545] f2fs_build_segment_manager+0x8fa/0x3410
[ 548.589551] ? __asan_loadN+0xf/0x20
[ 548.589560] ? f2fs_sanity_check_ckpt+0x1be/0x240
[ 548.589566] ? f2fs_flush_sit_entries+0x10c0/0x10c0
[ 548.589587] ? __put_user_ns+0x40/0x40
[ 548.589604] ? find_next_bit+0x57/0x90
[ 548.589610] f2fs_fill_super+0x194b/0x2b40
[ 548.589617] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589637] ? set_blocksize+0x90/0x140
[ 548.589651] mount_bdev+0x1c5/0x210
[ 548.589655] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589667] f2fs_mount+0x15/0x20
[ 548.589672] mount_fs+0x60/0x1a0
[ 548.589683] ? alloc_vfsmnt+0x309/0x360
[ 548.589688] vfs_kern_mount+0x6b/0x1a0
[ 548.589699] do_mount+0x34a/0x18c0
[ 548.589710] ? lockref_put_or_lock+0xcf/0x160
[ 548.589716] ? copy_mount_string+0x20/0x20
[ 548.589728] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.589734] ? kasan_check_write+0x14/0x20
[ 548.589740] ? _copy_from_user+0x6a/0x90
[ 548.589744] ? memdup_user+0x42/0x60
[ 548.589750] ksys_mount+0x83/0xd0
[ 548.589755] __x64_sys_mount+0x67/0x80
[ 548.589781] do_syscall_64+0x78/0x170
[ 548.589797] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.589820] RIP: 0033:0x7f76fc331b9a
[ 548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.590242] The buggy address belongs to the page:
[ 548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 548.592886] flags: 0x2ffff0000000000()
[ 548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[ 548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 548.603713] page dumped because: kasan: bad access detected
[ 548.605203] Memory state around the buggy address:
[ 548.606198] ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.607676] ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.610629] ^
[ 548.612088] ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.613674] ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.615141] ==================================================================
[ 548.616613] Disabling lock debugging due to kernel taint
[ 548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[ 548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[ 548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G B 4.18.0-rc1+ #4
[ 548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[ 548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[ 548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[ 548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[ 548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[ 548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[ 548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[ 548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[ 548.623299] FS: 00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 548.623302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[ 548.623317] Call Trace:
[ 548.623325] ? kasan_check_read+0x11/0x20
[ 548.623330] ? __zone_watermark_ok+0x92/0x240
[ 548.623336] ? get_page_from_freelist+0x1c3/0x1d90
[ 548.623347] ? _raw_spin_lock_irqsave+0x2a/0x60
[ 548.623353] ? warn_alloc+0x250/0x250
[ 548.623358] ? save_stack+0x46/0xd0
[ 548.623361] ? kasan_kmalloc+0xad/0xe0
[ 548.623366] ? __isolate_free_page+0x2a0/0x2a0
[ 548.623370] ? mount_fs+0x60/0x1a0
[ 548.623374] ? vfs_kern_mount+0x6b/0x1a0
[ 548.623378] ? do_mount+0x34a/0x18c0
[ 548.623383] ? ksys_mount+0x83/0xd0
[ 548.623387] ? __x64_sys_mount+0x67/0x80
[ 548.623391] ? do_syscall_64+0x78/0x170
[ 548.623396] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623401] __alloc_pages_nodemask+0x3c5/0x400
[ 548.623407] ? __alloc_pages_slowpath+0x1420/0x1420
[ 548.623412] ? __mutex_lock_slowpath+0x20/0x20
[ 548.623417] ? kvmalloc_node+0x31/0x80
[ 548.623424] alloc_pages_current+0x75/0x110
[ 548.623436] kmalloc_order+0x24/0x60
[ 548.623442] kmalloc_order_trace+0x24/0xb0
[ 548.623448] __kmalloc_track_caller+0x207/0x220
[ 548.623455] ? f2fs_build_node_manager+0x399/0xbb0
[ 548.623460] kmemdup+0x20/0x50
[ 548.623465] f2fs_build_node_manager+0x399/0xbb0
[ 548.623470] f2fs_fill_super+0x195e/0x2b40
[ 548.623477] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623481] ? set_blocksize+0x90/0x140
[ 548.623486] mount_bdev+0x1c5/0x210
[ 548.623489] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623495] f2fs_mount+0x15/0x20
[ 548.623498] mount_fs+0x60/0x1a0
[ 548.623503] ? alloc_vfsmnt+0x309/0x360
[ 548.623508] vfs_kern_mount+0x6b/0x1a0
[ 548.623513] do_mount+0x34a/0x18c0
[ 548.623518] ? lockref_put_or_lock+0xcf/0x160
[ 548.623523] ? copy_mount_string+0x20/0x20
[ 548.623528] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.623533] ? kasan_check_write+0x14/0x20
[ 548.623537] ? _copy_from_user+0x6a/0x90
[ 548.623542] ? memdup_user+0x42/0x60
[ 548.623547] ksys_mount+0x83/0xd0
[ 548.623552] __x64_sys_mount+0x67/0x80
[ 548.623557] do_syscall_64+0x78/0x170
[ 548.623562] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623566] RIP: 0033:0x7f76fc331b9a
[ 548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[ 548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[ 548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager
- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578
sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);
Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.
Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-23 11:25:19 +08:00
sit_segs = le32_to_cpu ( raw_super - > segment_count_sit ) ;
fsmeta + = sit_segs ;
nat_segs = le32_to_cpu ( raw_super - > segment_count_nat ) ;
fsmeta + = nat_segs ;
2012-11-02 17:07:47 +09:00
fsmeta + = le32_to_cpu ( ckpt - > rsvd_segment_count ) ;
fsmeta + = le32_to_cpu ( raw_super - > segment_count_ssa ) ;
2013-12-06 15:00:58 +09:00
if ( unlikely ( fsmeta > = total ) )
2012-11-02 17:07:47 +09:00
return 1 ;
f2fs: prevent checkpoint once any IO failure is detected
This patch enhances the checkpoint routine to cope with IO errors.
Basically f2fs detects IO errors from end_io_write, and the errors are able to
be occurred during one of data, node, and meta page writes.
In the previous code, when an IO error is occurred during writes, f2fs sets a
flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
Afterwards, write_checkpoint() will check the flag and remount f2fs as a
read-only (ro) mode.
However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
freely able to be written to disk by flusher or kswapd in background.
In such a case, after cold reboot, f2fs would restore the checkpoint data having
CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
a ro mode again.
Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
occurred, and remount f2fs as a ro mode right away at that moment.
Reported-by: Oliver Winker <oliver@oli1170.net>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2013-01-24 19:56:11 +09:00
2016-12-05 13:56:04 -08:00
ovp_segments = le32_to_cpu ( ckpt - > overprov_segment_count ) ;
reserved_segments = le32_to_cpu ( ckpt - > rsvd_segment_count ) ;
2021-05-21 01:32:53 -07:00
if ( ! f2fs_sb_has_readonly ( sbi ) & &
unlikely ( fsmeta < F2FS_MIN_META_SEGMENTS | |
2016-12-05 13:56:04 -08:00
ovp_segments = = 0 | | reserved_segments = = 0 ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Wrong layout: check mkfs.f2fs version " ) ;
2016-12-05 13:56:04 -08:00
return 1 ;
}
2018-06-27 18:05:54 +08:00
user_block_count = le64_to_cpu ( ckpt - > user_block_count ) ;
2021-05-21 01:32:53 -07:00
segment_count_main = le32_to_cpu ( raw_super - > segment_count_main ) +
( f2fs_sb_has_readonly ( sbi ) ? 1 : 0 ) ;
2018-06-27 18:05:54 +08:00
log_blocks_per_seg = le32_to_cpu ( raw_super - > log_blocks_per_seg ) ;
if ( ! user_block_count | | user_block_count > =
segment_count_main < < log_blocks_per_seg ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Wrong user_block_count: %u " ,
user_block_count ) ;
2018-06-27 18:05:54 +08:00
return 1 ;
}
2019-04-15 15:30:50 +08:00
valid_user_blocks = le64_to_cpu ( ckpt - > valid_block_count ) ;
if ( valid_user_blocks > user_block_count ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Wrong valid_user_blocks: %u, user_block_count: %u " ,
valid_user_blocks , user_block_count ) ;
2019-04-15 15:30:50 +08:00
return 1 ;
}
valid_node_count = le32_to_cpu ( ckpt - > valid_node_count ) ;
2019-08-05 18:27:25 +08:00
avail_node_count = sbi - > total_node_count - F2FS_RESERVED_NODE_NUM ;
2019-04-15 15:30:50 +08:00
if ( valid_node_count > avail_node_count ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Wrong valid_node_count: %u, avail_node_count: %u " ,
valid_node_count , avail_node_count ) ;
2019-04-15 15:30:50 +08:00
return 1 ;
}
2017-05-15 10:45:08 -07:00
main_segs = le32_to_cpu ( raw_super - > segment_count_main ) ;
blocks_per_seg = sbi - > blocks_per_seg ;
for ( i = 0 ; i < NR_CURSEG_NODE_TYPE ; i + + ) {
if ( le32_to_cpu ( ckpt - > cur_node_segno [ i ] ) > = main_segs | |
le16_to_cpu ( ckpt - > cur_node_blkoff [ i ] ) > = blocks_per_seg )
return 1 ;
2021-05-21 01:32:53 -07:00
if ( f2fs_sb_has_readonly ( sbi ) )
goto check_data ;
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
for ( j = i + 1 ; j < NR_CURSEG_NODE_TYPE ; j + + ) {
if ( le32_to_cpu ( ckpt - > cur_node_segno [ i ] ) = =
le32_to_cpu ( ckpt - > cur_node_segno [ j ] ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Node segment (%u, %u) has the same segno: %u " ,
i , j ,
le32_to_cpu ( ckpt - > cur_node_segno [ i ] ) ) ;
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
return 1 ;
}
}
2017-05-15 10:45:08 -07:00
}
2021-05-21 01:32:53 -07:00
check_data :
2017-05-15 10:45:08 -07:00
for ( i = 0 ; i < NR_CURSEG_DATA_TYPE ; i + + ) {
if ( le32_to_cpu ( ckpt - > cur_data_segno [ i ] ) > = main_segs | |
le16_to_cpu ( ckpt - > cur_data_blkoff [ i ] ) > = blocks_per_seg )
return 1 ;
2021-05-21 01:32:53 -07:00
if ( f2fs_sb_has_readonly ( sbi ) )
goto skip_cross ;
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
for ( j = i + 1 ; j < NR_CURSEG_DATA_TYPE ; j + + ) {
if ( le32_to_cpu ( ckpt - > cur_data_segno [ i ] ) = =
le32_to_cpu ( ckpt - > cur_data_segno [ j ] ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Data segment (%u, %u) has the same segno: %u " ,
i , j ,
le32_to_cpu ( ckpt - > cur_data_segno [ i ] ) ) ;
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
return 1 ;
}
}
}
for ( i = 0 ; i < NR_CURSEG_NODE_TYPE ; i + + ) {
2019-08-23 15:40:45 -07:00
for ( j = 0 ; j < NR_CURSEG_DATA_TYPE ; j + + ) {
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
if ( le32_to_cpu ( ckpt - > cur_node_segno [ i ] ) = =
le32_to_cpu ( ckpt - > cur_data_segno [ j ] ) ) {
2019-08-23 15:40:45 -07:00
f2fs_err ( sbi , " Node segment (%u) and Data segment (%u) has the same segno: %u " ,
2019-06-18 17:48:42 +08:00
i , j ,
le32_to_cpu ( ckpt - > cur_node_segno [ i ] ) ) ;
f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reproduction way:
- mount image
- run poc code
- umount image
F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
f2fs_allocate_data_block+0x124/0x580 [f2fs]
do_write_page+0x78/0x150 [f2fs]
f2fs_do_write_node_page+0x25/0xa0 [f2fs]
__write_node_page+0x2bf/0x550 [f2fs]
f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
? sync_inode_metadata+0x2f/0x40
? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
? up_write+0x1e/0x80
f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
? mark_held_locks+0x5d/0x80
? _raw_spin_unlock_irq+0x27/0x50
kill_f2fs_super+0x68/0x90 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---
The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.
Main area: 24 segs, 24 secs 24 zones
- COLD data: 0, 0, 0
- WARM data: 1, 1, 1
- HOT data: 20, 20, 20
- Dir dnode: 22, 22, 22
- File dnode: 22, 22, 22
- Indir nodes: 21, 21, 21
So this patch adds sanity check to detect such condition to avoid
this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-06 20:34:12 +08:00
return 1 ;
}
}
2017-05-15 10:45:08 -07:00
}
2021-05-21 01:32:53 -07:00
skip_cross :
f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.
- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image
- Reproduce
- Kernel message
[ 548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.584979] ==================================================================
[ 548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[ 548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295
[ 548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[ 548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.589438] Call Trace:
[ 548.589474] dump_stack+0x7b/0xb5
[ 548.589487] print_address_description+0x70/0x290
[ 548.589492] kasan_report+0x291/0x390
[ 548.589496] ? kmemdup+0x36/0x50
[ 548.589509] check_memory_region+0x139/0x190
[ 548.589514] memcpy+0x23/0x50
[ 548.589518] kmemdup+0x36/0x50
[ 548.589545] f2fs_build_segment_manager+0x8fa/0x3410
[ 548.589551] ? __asan_loadN+0xf/0x20
[ 548.589560] ? f2fs_sanity_check_ckpt+0x1be/0x240
[ 548.589566] ? f2fs_flush_sit_entries+0x10c0/0x10c0
[ 548.589587] ? __put_user_ns+0x40/0x40
[ 548.589604] ? find_next_bit+0x57/0x90
[ 548.589610] f2fs_fill_super+0x194b/0x2b40
[ 548.589617] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589637] ? set_blocksize+0x90/0x140
[ 548.589651] mount_bdev+0x1c5/0x210
[ 548.589655] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589667] f2fs_mount+0x15/0x20
[ 548.589672] mount_fs+0x60/0x1a0
[ 548.589683] ? alloc_vfsmnt+0x309/0x360
[ 548.589688] vfs_kern_mount+0x6b/0x1a0
[ 548.589699] do_mount+0x34a/0x18c0
[ 548.589710] ? lockref_put_or_lock+0xcf/0x160
[ 548.589716] ? copy_mount_string+0x20/0x20
[ 548.589728] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.589734] ? kasan_check_write+0x14/0x20
[ 548.589740] ? _copy_from_user+0x6a/0x90
[ 548.589744] ? memdup_user+0x42/0x60
[ 548.589750] ksys_mount+0x83/0xd0
[ 548.589755] __x64_sys_mount+0x67/0x80
[ 548.589781] do_syscall_64+0x78/0x170
[ 548.589797] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.589820] RIP: 0033:0x7f76fc331b9a
[ 548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.590242] The buggy address belongs to the page:
[ 548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 548.592886] flags: 0x2ffff0000000000()
[ 548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[ 548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 548.603713] page dumped because: kasan: bad access detected
[ 548.605203] Memory state around the buggy address:
[ 548.606198] ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.607676] ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.610629] ^
[ 548.612088] ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.613674] ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.615141] ==================================================================
[ 548.616613] Disabling lock debugging due to kernel taint
[ 548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[ 548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[ 548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G B 4.18.0-rc1+ #4
[ 548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[ 548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[ 548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[ 548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[ 548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[ 548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[ 548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[ 548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[ 548.623299] FS: 00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 548.623302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[ 548.623317] Call Trace:
[ 548.623325] ? kasan_check_read+0x11/0x20
[ 548.623330] ? __zone_watermark_ok+0x92/0x240
[ 548.623336] ? get_page_from_freelist+0x1c3/0x1d90
[ 548.623347] ? _raw_spin_lock_irqsave+0x2a/0x60
[ 548.623353] ? warn_alloc+0x250/0x250
[ 548.623358] ? save_stack+0x46/0xd0
[ 548.623361] ? kasan_kmalloc+0xad/0xe0
[ 548.623366] ? __isolate_free_page+0x2a0/0x2a0
[ 548.623370] ? mount_fs+0x60/0x1a0
[ 548.623374] ? vfs_kern_mount+0x6b/0x1a0
[ 548.623378] ? do_mount+0x34a/0x18c0
[ 548.623383] ? ksys_mount+0x83/0xd0
[ 548.623387] ? __x64_sys_mount+0x67/0x80
[ 548.623391] ? do_syscall_64+0x78/0x170
[ 548.623396] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623401] __alloc_pages_nodemask+0x3c5/0x400
[ 548.623407] ? __alloc_pages_slowpath+0x1420/0x1420
[ 548.623412] ? __mutex_lock_slowpath+0x20/0x20
[ 548.623417] ? kvmalloc_node+0x31/0x80
[ 548.623424] alloc_pages_current+0x75/0x110
[ 548.623436] kmalloc_order+0x24/0x60
[ 548.623442] kmalloc_order_trace+0x24/0xb0
[ 548.623448] __kmalloc_track_caller+0x207/0x220
[ 548.623455] ? f2fs_build_node_manager+0x399/0xbb0
[ 548.623460] kmemdup+0x20/0x50
[ 548.623465] f2fs_build_node_manager+0x399/0xbb0
[ 548.623470] f2fs_fill_super+0x195e/0x2b40
[ 548.623477] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623481] ? set_blocksize+0x90/0x140
[ 548.623486] mount_bdev+0x1c5/0x210
[ 548.623489] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623495] f2fs_mount+0x15/0x20
[ 548.623498] mount_fs+0x60/0x1a0
[ 548.623503] ? alloc_vfsmnt+0x309/0x360
[ 548.623508] vfs_kern_mount+0x6b/0x1a0
[ 548.623513] do_mount+0x34a/0x18c0
[ 548.623518] ? lockref_put_or_lock+0xcf/0x160
[ 548.623523] ? copy_mount_string+0x20/0x20
[ 548.623528] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.623533] ? kasan_check_write+0x14/0x20
[ 548.623537] ? _copy_from_user+0x6a/0x90
[ 548.623542] ? memdup_user+0x42/0x60
[ 548.623547] ksys_mount+0x83/0xd0
[ 548.623552] __x64_sys_mount+0x67/0x80
[ 548.623557] do_syscall_64+0x78/0x170
[ 548.623562] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623566] RIP: 0033:0x7f76fc331b9a
[ 548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[ 548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[ 548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager
- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578
sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);
Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.
Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-23 11:25:19 +08:00
sit_bitmap_size = le32_to_cpu ( ckpt - > sit_ver_bitmap_bytesize ) ;
nat_bitmap_size = le32_to_cpu ( ckpt - > nat_ver_bitmap_bytesize ) ;
if ( sit_bitmap_size ! = ( ( sit_segs / 2 ) < < log_blocks_per_seg ) / 8 | |
nat_bitmap_size ! = ( ( nat_segs / 2 ) < < log_blocks_per_seg ) / 8 ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Wrong bitmap size: sit: %u, nat:%u " ,
sit_bitmap_size , nat_bitmap_size ) ;
f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.
- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image
- Reproduce
- Kernel message
[ 548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.584979] ==================================================================
[ 548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[ 548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295
[ 548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[ 548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.589438] Call Trace:
[ 548.589474] dump_stack+0x7b/0xb5
[ 548.589487] print_address_description+0x70/0x290
[ 548.589492] kasan_report+0x291/0x390
[ 548.589496] ? kmemdup+0x36/0x50
[ 548.589509] check_memory_region+0x139/0x190
[ 548.589514] memcpy+0x23/0x50
[ 548.589518] kmemdup+0x36/0x50
[ 548.589545] f2fs_build_segment_manager+0x8fa/0x3410
[ 548.589551] ? __asan_loadN+0xf/0x20
[ 548.589560] ? f2fs_sanity_check_ckpt+0x1be/0x240
[ 548.589566] ? f2fs_flush_sit_entries+0x10c0/0x10c0
[ 548.589587] ? __put_user_ns+0x40/0x40
[ 548.589604] ? find_next_bit+0x57/0x90
[ 548.589610] f2fs_fill_super+0x194b/0x2b40
[ 548.589617] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589637] ? set_blocksize+0x90/0x140
[ 548.589651] mount_bdev+0x1c5/0x210
[ 548.589655] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589667] f2fs_mount+0x15/0x20
[ 548.589672] mount_fs+0x60/0x1a0
[ 548.589683] ? alloc_vfsmnt+0x309/0x360
[ 548.589688] vfs_kern_mount+0x6b/0x1a0
[ 548.589699] do_mount+0x34a/0x18c0
[ 548.589710] ? lockref_put_or_lock+0xcf/0x160
[ 548.589716] ? copy_mount_string+0x20/0x20
[ 548.589728] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.589734] ? kasan_check_write+0x14/0x20
[ 548.589740] ? _copy_from_user+0x6a/0x90
[ 548.589744] ? memdup_user+0x42/0x60
[ 548.589750] ksys_mount+0x83/0xd0
[ 548.589755] __x64_sys_mount+0x67/0x80
[ 548.589781] do_syscall_64+0x78/0x170
[ 548.589797] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.589820] RIP: 0033:0x7f76fc331b9a
[ 548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.590242] The buggy address belongs to the page:
[ 548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 548.592886] flags: 0x2ffff0000000000()
[ 548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[ 548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 548.603713] page dumped because: kasan: bad access detected
[ 548.605203] Memory state around the buggy address:
[ 548.606198] ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.607676] ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.610629] ^
[ 548.612088] ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.613674] ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.615141] ==================================================================
[ 548.616613] Disabling lock debugging due to kernel taint
[ 548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[ 548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[ 548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G B 4.18.0-rc1+ #4
[ 548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[ 548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[ 548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[ 548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[ 548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[ 548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[ 548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[ 548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[ 548.623299] FS: 00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 548.623302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[ 548.623317] Call Trace:
[ 548.623325] ? kasan_check_read+0x11/0x20
[ 548.623330] ? __zone_watermark_ok+0x92/0x240
[ 548.623336] ? get_page_from_freelist+0x1c3/0x1d90
[ 548.623347] ? _raw_spin_lock_irqsave+0x2a/0x60
[ 548.623353] ? warn_alloc+0x250/0x250
[ 548.623358] ? save_stack+0x46/0xd0
[ 548.623361] ? kasan_kmalloc+0xad/0xe0
[ 548.623366] ? __isolate_free_page+0x2a0/0x2a0
[ 548.623370] ? mount_fs+0x60/0x1a0
[ 548.623374] ? vfs_kern_mount+0x6b/0x1a0
[ 548.623378] ? do_mount+0x34a/0x18c0
[ 548.623383] ? ksys_mount+0x83/0xd0
[ 548.623387] ? __x64_sys_mount+0x67/0x80
[ 548.623391] ? do_syscall_64+0x78/0x170
[ 548.623396] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623401] __alloc_pages_nodemask+0x3c5/0x400
[ 548.623407] ? __alloc_pages_slowpath+0x1420/0x1420
[ 548.623412] ? __mutex_lock_slowpath+0x20/0x20
[ 548.623417] ? kvmalloc_node+0x31/0x80
[ 548.623424] alloc_pages_current+0x75/0x110
[ 548.623436] kmalloc_order+0x24/0x60
[ 548.623442] kmalloc_order_trace+0x24/0xb0
[ 548.623448] __kmalloc_track_caller+0x207/0x220
[ 548.623455] ? f2fs_build_node_manager+0x399/0xbb0
[ 548.623460] kmemdup+0x20/0x50
[ 548.623465] f2fs_build_node_manager+0x399/0xbb0
[ 548.623470] f2fs_fill_super+0x195e/0x2b40
[ 548.623477] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623481] ? set_blocksize+0x90/0x140
[ 548.623486] mount_bdev+0x1c5/0x210
[ 548.623489] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623495] f2fs_mount+0x15/0x20
[ 548.623498] mount_fs+0x60/0x1a0
[ 548.623503] ? alloc_vfsmnt+0x309/0x360
[ 548.623508] vfs_kern_mount+0x6b/0x1a0
[ 548.623513] do_mount+0x34a/0x18c0
[ 548.623518] ? lockref_put_or_lock+0xcf/0x160
[ 548.623523] ? copy_mount_string+0x20/0x20
[ 548.623528] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.623533] ? kasan_check_write+0x14/0x20
[ 548.623537] ? _copy_from_user+0x6a/0x90
[ 548.623542] ? memdup_user+0x42/0x60
[ 548.623547] ksys_mount+0x83/0xd0
[ 548.623552] __x64_sys_mount+0x67/0x80
[ 548.623557] do_syscall_64+0x78/0x170
[ 548.623562] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623566] RIP: 0033:0x7f76fc331b9a
[ 548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[ 548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[ 548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager
- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578
sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);
Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.
Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-23 11:25:19 +08:00
return 1 ;
}
f2fs: fix to do sanity check with cp_pack_start_sum
After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.
Add sanity check for cp_pack_start_sum to fix this issue.
https://bugzilla.kernel.org/show_bug.cgi?id=200419
- Reproduce
- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)
[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 __get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:ffff88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: ffff88018f0a6a78 RBX: ffffea0007a46600 RCX: ffffffff9314d1b2
[ 3117.584085] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff88018f0a6a98
[ 3117.584087] RBP: ffff88018ebe9980 R08: 0000000000000002 R09: 0000000000000001
[ 3117.584090] R10: 0000000000000001 R11: ffffed00326e4450 R12: ffff880193722200
[ 3117.584092] R13: ffff88018ebe9afc R14: 0000000000000206 R15: ffff88018eb67900
[ 3117.584096] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.584098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.584101] CR2: 00000000016f21b8 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.584112] Call Trace:
[ 3117.584121] ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127] ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133] ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145] f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151] ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156] ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184] ? map_id_range_down+0x17c/0x1b0
[ 3117.584188] ? __put_user_ns+0x30/0x30
[ 3117.584206] ? find_next_bit+0x53/0x90
[ 3117.584237] ? cpumask_next+0x16/0x20
[ 3117.584249] f2fs_fill_super+0x1948/0x2b40
[ 3117.584258] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279] ? sget_userns+0x65e/0x690
[ 3117.584296] ? set_blocksize+0x88/0x130
[ 3117.584302] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305] mount_bdev+0x1c0/0x200
[ 3117.584310] mount_fs+0x5c/0x190
[ 3117.584320] vfs_kern_mount+0x64/0x190
[ 3117.584330] do_mount+0x2e4/0x1450
[ 3117.584343] ? lockref_put_return+0x130/0x130
[ 3117.584347] ? copy_mount_string+0x20/0x20
[ 3117.584357] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362] ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377] ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383] ? _copy_from_user+0x61/0x90
[ 3117.584396] ? memdup_user+0x3e/0x60
[ 3117.584401] ksys_mount+0x7e/0xd0
[ 3117.584405] __x64_sys_mount+0x62/0x70
[ 3117.584427] do_syscall_64+0x73/0x160
[ 3117.584440] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.584510] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.584512] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.584514] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.584516] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.584519] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_data: ino = 2 (i_size: recover) recovered = 1, err = 0
[ 3117.685707] ==================================================================
[ 3117.685955] BUG: KASAN: slab-out-of-bounds in __remove_dirty_segment+0xdd/0x1e0
[ 3117.686175] Read of size 8 at addr ffff88018f0a63d0 by task mount/1225
[ 3117.686477] CPU: 0 PID: 1225 Comm: mount Tainted: G W 4.17.0+ #1
[ 3117.686481] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.686483] Call Trace:
[ 3117.686494] dump_stack+0x71/0xab
[ 3117.686512] print_address_description+0x6b/0x290
[ 3117.686517] kasan_report+0x28e/0x390
[ 3117.686522] ? __remove_dirty_segment+0xdd/0x1e0
[ 3117.686527] __remove_dirty_segment+0xdd/0x1e0
[ 3117.686532] locate_dirty_segment+0x189/0x190
[ 3117.686538] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.686543] recover_data+0x703/0x2c20
[ 3117.686547] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.686553] ? ksys_mount+0x7e/0xd0
[ 3117.686564] ? policy_nodemask+0x1a/0x90
[ 3117.686567] ? policy_node+0x56/0x70
[ 3117.686571] ? add_fsync_inode+0xf0/0xf0
[ 3117.686592] ? blk_finish_plug+0x44/0x60
[ 3117.686597] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.686602] ? find_inode_fast+0xac/0xc0
[ 3117.686606] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.686618] ? __radix_tree_lookup+0x150/0x150
[ 3117.686633] ? dqget+0x670/0x670
[ 3117.686648] ? pagecache_get_page+0x29/0x410
[ 3117.686656] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.686660] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.686664] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.686670] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.686674] ? rb_insert_color+0x323/0x3d0
[ 3117.686678] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.686683] ? proc_register+0x153/0x1d0
[ 3117.686686] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.686695] ? f2fs_attr_store+0x50/0x50
[ 3117.686700] ? proc_create_single_data+0x52/0x60
[ 3117.686707] f2fs_fill_super+0x1d06/0x2b40
[ 3117.686728] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686735] ? sget_userns+0x65e/0x690
[ 3117.686740] ? set_blocksize+0x88/0x130
[ 3117.686745] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686748] mount_bdev+0x1c0/0x200
[ 3117.686753] mount_fs+0x5c/0x190
[ 3117.686758] vfs_kern_mount+0x64/0x190
[ 3117.686762] do_mount+0x2e4/0x1450
[ 3117.686769] ? lockref_put_return+0x130/0x130
[ 3117.686773] ? copy_mount_string+0x20/0x20
[ 3117.686777] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.686780] ? kasan_kmalloc+0xa6/0xd0
[ 3117.686786] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.686790] ? __kmalloc_track_caller+0x196/0x210
[ 3117.686795] ? _copy_from_user+0x61/0x90
[ 3117.686801] ? memdup_user+0x3e/0x60
[ 3117.686804] ksys_mount+0x7e/0xd0
[ 3117.686809] __x64_sys_mount+0x62/0x70
[ 3117.686816] do_syscall_64+0x73/0x160
[ 3117.686824] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.686829] RIP: 0033:0x7f5693f14b9a
[ 3117.686830] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.686887] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.686892] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.686894] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.686896] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.686899] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.686901] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.687005] Allocated by task 1225:
[ 3117.687152] kasan_kmalloc+0xa6/0xd0
[ 3117.687157] kmem_cache_alloc_trace+0xfd/0x200
[ 3117.687161] f2fs_build_segment_manager+0x2d09/0x3190
[ 3117.687165] f2fs_fill_super+0x1948/0x2b40
[ 3117.687168] mount_bdev+0x1c0/0x200
[ 3117.687171] mount_fs+0x5c/0x190
[ 3117.687174] vfs_kern_mount+0x64/0x190
[ 3117.687177] do_mount+0x2e4/0x1450
[ 3117.687180] ksys_mount+0x7e/0xd0
[ 3117.687182] __x64_sys_mount+0x62/0x70
[ 3117.687186] do_syscall_64+0x73/0x160
[ 3117.687190] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.687285] Freed by task 19:
[ 3117.687412] __kasan_slab_free+0x137/0x190
[ 3117.687416] kfree+0x8b/0x1b0
[ 3117.687460] ttm_bo_man_put_node+0x61/0x80 [ttm]
[ 3117.687476] ttm_bo_cleanup_refs+0x15f/0x250 [ttm]
[ 3117.687492] ttm_bo_delayed_delete+0x2f0/0x300 [ttm]
[ 3117.687507] ttm_bo_delayed_workqueue+0x17/0x50 [ttm]
[ 3117.687528] process_one_work+0x2f9/0x740
[ 3117.687531] worker_thread+0x78/0x6b0
[ 3117.687541] kthread+0x177/0x1c0
[ 3117.687545] ret_from_fork+0x35/0x40
[ 3117.687638] The buggy address belongs to the object at ffff88018f0a6300
which belongs to the cache kmalloc-192 of size 192
[ 3117.688014] The buggy address is located 16 bytes to the right of
192-byte region [ffff88018f0a6300, ffff88018f0a63c0)
[ 3117.688382] The buggy address belongs to the page:
[ 3117.688554] page:ffffea00063c2980 count:1 mapcount:0 mapping:ffff8801f3403180 index:0x0
[ 3117.688788] flags: 0x17fff8000000100(slab)
[ 3117.688944] raw: 017fff8000000100 ffffea00063c2840 0000000e0000000e ffff8801f3403180
[ 3117.689166] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 3117.689386] page dumped because: kasan: bad access detected
[ 3117.689653] Memory state around the buggy address:
[ 3117.689816] ffff88018f0a6280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 3117.690027] ffff88018f0a6300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690239] >ffff88018f0a6380: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.690448] ^
[ 3117.690644] ffff88018f0a6400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690868] ffff88018f0a6480: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.691077] ==================================================================
[ 3117.691290] Disabling lock debugging due to kernel taint
[ 3117.693893] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 3117.694120] PGD 80000001f01bc067 P4D 80000001f01bc067 PUD 1d9638067 PMD 0
[ 3117.694338] Oops: 0002 [#1] SMP KASAN PTI
[ 3117.694490] CPU: 1 PID: 1225 Comm: mount Tainted: G B W 4.17.0+ #1
[ 3117.694703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.695073] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.695246] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.695793] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.695969] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.696182] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.696391] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.696604] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.696813] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.697032] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.697280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.702357] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.707235] Call Trace:
[ 3117.712077] locate_dirty_segment+0x189/0x190
[ 3117.716891] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.721617] recover_data+0x703/0x2c20
[ 3117.726316] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.730957] ? ksys_mount+0x7e/0xd0
[ 3117.735573] ? policy_nodemask+0x1a/0x90
[ 3117.740198] ? policy_node+0x56/0x70
[ 3117.744829] ? add_fsync_inode+0xf0/0xf0
[ 3117.749487] ? blk_finish_plug+0x44/0x60
[ 3117.754152] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.758831] ? find_inode_fast+0xac/0xc0
[ 3117.763448] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.768046] ? __radix_tree_lookup+0x150/0x150
[ 3117.772603] ? dqget+0x670/0x670
[ 3117.777159] ? pagecache_get_page+0x29/0x410
[ 3117.781648] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.786067] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.790476] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.794790] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.799086] ? rb_insert_color+0x323/0x3d0
[ 3117.803304] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.807563] ? proc_register+0x153/0x1d0
[ 3117.811766] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.815947] ? f2fs_attr_store+0x50/0x50
[ 3117.820087] ? proc_create_single_data+0x52/0x60
[ 3117.824262] f2fs_fill_super+0x1d06/0x2b40
[ 3117.828367] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.832432] ? sget_userns+0x65e/0x690
[ 3117.836500] ? set_blocksize+0x88/0x130
[ 3117.840501] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.844420] mount_bdev+0x1c0/0x200
[ 3117.848275] mount_fs+0x5c/0x190
[ 3117.852053] vfs_kern_mount+0x64/0x190
[ 3117.855810] do_mount+0x2e4/0x1450
[ 3117.859441] ? lockref_put_return+0x130/0x130
[ 3117.862996] ? copy_mount_string+0x20/0x20
[ 3117.866417] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.869719] ? kasan_kmalloc+0xa6/0xd0
[ 3117.872948] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.876121] ? __kmalloc_track_caller+0x196/0x210
[ 3117.879333] ? _copy_from_user+0x61/0x90
[ 3117.882467] ? memdup_user+0x3e/0x60
[ 3117.885604] ksys_mount+0x7e/0xd0
[ 3117.888700] __x64_sys_mount+0x62/0x70
[ 3117.891742] do_syscall_64+0x73/0x160
[ 3117.894692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.897669] RIP: 0033:0x7f5693f14b9a
[ 3117.900563] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.906922] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.910159] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.913469] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.916764] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.920071] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.923393] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.926680] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.949979] CR2: 0000000000000000
[ 3117.954283] ---[ end trace a8e0d899985faf32 ]---
[ 3117.958575] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.962810] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.971789] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.976333] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.980926] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.985497] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.990098] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.994761] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.999392] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3118.004096] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3118.008816] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/segment.c#L775
if (test_and_clear_bit(segno, dirty_i->dirty_segmap[t]))
dirty_i->nr_dirty[t]--;
Here dirty_i->dirty_segmap[t] can be NULL which leads to crash in test_and_clear_bit()
Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 19:16:11 +08:00
cp_pack_start_sum = __start_sum_addr ( sbi ) ;
cp_payload = __cp_payload ( sbi ) ;
if ( cp_pack_start_sum < cp_payload + 1 | |
cp_pack_start_sum > blocks_per_seg - 1 -
f2fs: introduce inmem curseg
Previous implementation of aligned pinfile allocation will:
- allocate new segment on cold data log no matter whether last used
segment is partially used or not, it makes IOs more random;
- force concurrent cold data/GCed IO going into warm data area, it
can make a bad effect on hot/cold data separation;
In this patch, we introduce a new type of log named 'inmem curseg',
the differents from normal curseg is:
- it reuses existed segment type (CURSEG_XXX_NODE/DATA);
- it only exists in memory, its segno, blkofs, summary will not b
persisted into checkpoint area;
With this new feature, we can enhance scalability of log, special
allocators can be created for purposes:
- pure lfs allocator for aligned pinfile allocation or file
defragmentation
- pure ssr allocator for later feature
So that, let's update aligned pinfile allocation to use this new
inmem curseg fwk.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:45 +08:00
NR_CURSEG_PERSIST_TYPE ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Wrong cp_pack_start_sum: %u " ,
cp_pack_start_sum ) ;
f2fs: fix to do sanity check with cp_pack_start_sum
After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.
Add sanity check for cp_pack_start_sum to fix this issue.
https://bugzilla.kernel.org/show_bug.cgi?id=200419
- Reproduce
- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)
[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 __get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:ffff88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: ffff88018f0a6a78 RBX: ffffea0007a46600 RCX: ffffffff9314d1b2
[ 3117.584085] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff88018f0a6a98
[ 3117.584087] RBP: ffff88018ebe9980 R08: 0000000000000002 R09: 0000000000000001
[ 3117.584090] R10: 0000000000000001 R11: ffffed00326e4450 R12: ffff880193722200
[ 3117.584092] R13: ffff88018ebe9afc R14: 0000000000000206 R15: ffff88018eb67900
[ 3117.584096] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.584098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.584101] CR2: 00000000016f21b8 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.584112] Call Trace:
[ 3117.584121] ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127] ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133] ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145] f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151] ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156] ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184] ? map_id_range_down+0x17c/0x1b0
[ 3117.584188] ? __put_user_ns+0x30/0x30
[ 3117.584206] ? find_next_bit+0x53/0x90
[ 3117.584237] ? cpumask_next+0x16/0x20
[ 3117.584249] f2fs_fill_super+0x1948/0x2b40
[ 3117.584258] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279] ? sget_userns+0x65e/0x690
[ 3117.584296] ? set_blocksize+0x88/0x130
[ 3117.584302] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305] mount_bdev+0x1c0/0x200
[ 3117.584310] mount_fs+0x5c/0x190
[ 3117.584320] vfs_kern_mount+0x64/0x190
[ 3117.584330] do_mount+0x2e4/0x1450
[ 3117.584343] ? lockref_put_return+0x130/0x130
[ 3117.584347] ? copy_mount_string+0x20/0x20
[ 3117.584357] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362] ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377] ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383] ? _copy_from_user+0x61/0x90
[ 3117.584396] ? memdup_user+0x3e/0x60
[ 3117.584401] ksys_mount+0x7e/0xd0
[ 3117.584405] __x64_sys_mount+0x62/0x70
[ 3117.584427] do_syscall_64+0x73/0x160
[ 3117.584440] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.584510] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.584512] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.584514] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.584516] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.584519] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_data: ino = 2 (i_size: recover) recovered = 1, err = 0
[ 3117.685707] ==================================================================
[ 3117.685955] BUG: KASAN: slab-out-of-bounds in __remove_dirty_segment+0xdd/0x1e0
[ 3117.686175] Read of size 8 at addr ffff88018f0a63d0 by task mount/1225
[ 3117.686477] CPU: 0 PID: 1225 Comm: mount Tainted: G W 4.17.0+ #1
[ 3117.686481] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.686483] Call Trace:
[ 3117.686494] dump_stack+0x71/0xab
[ 3117.686512] print_address_description+0x6b/0x290
[ 3117.686517] kasan_report+0x28e/0x390
[ 3117.686522] ? __remove_dirty_segment+0xdd/0x1e0
[ 3117.686527] __remove_dirty_segment+0xdd/0x1e0
[ 3117.686532] locate_dirty_segment+0x189/0x190
[ 3117.686538] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.686543] recover_data+0x703/0x2c20
[ 3117.686547] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.686553] ? ksys_mount+0x7e/0xd0
[ 3117.686564] ? policy_nodemask+0x1a/0x90
[ 3117.686567] ? policy_node+0x56/0x70
[ 3117.686571] ? add_fsync_inode+0xf0/0xf0
[ 3117.686592] ? blk_finish_plug+0x44/0x60
[ 3117.686597] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.686602] ? find_inode_fast+0xac/0xc0
[ 3117.686606] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.686618] ? __radix_tree_lookup+0x150/0x150
[ 3117.686633] ? dqget+0x670/0x670
[ 3117.686648] ? pagecache_get_page+0x29/0x410
[ 3117.686656] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.686660] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.686664] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.686670] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.686674] ? rb_insert_color+0x323/0x3d0
[ 3117.686678] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.686683] ? proc_register+0x153/0x1d0
[ 3117.686686] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.686695] ? f2fs_attr_store+0x50/0x50
[ 3117.686700] ? proc_create_single_data+0x52/0x60
[ 3117.686707] f2fs_fill_super+0x1d06/0x2b40
[ 3117.686728] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686735] ? sget_userns+0x65e/0x690
[ 3117.686740] ? set_blocksize+0x88/0x130
[ 3117.686745] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686748] mount_bdev+0x1c0/0x200
[ 3117.686753] mount_fs+0x5c/0x190
[ 3117.686758] vfs_kern_mount+0x64/0x190
[ 3117.686762] do_mount+0x2e4/0x1450
[ 3117.686769] ? lockref_put_return+0x130/0x130
[ 3117.686773] ? copy_mount_string+0x20/0x20
[ 3117.686777] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.686780] ? kasan_kmalloc+0xa6/0xd0
[ 3117.686786] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.686790] ? __kmalloc_track_caller+0x196/0x210
[ 3117.686795] ? _copy_from_user+0x61/0x90
[ 3117.686801] ? memdup_user+0x3e/0x60
[ 3117.686804] ksys_mount+0x7e/0xd0
[ 3117.686809] __x64_sys_mount+0x62/0x70
[ 3117.686816] do_syscall_64+0x73/0x160
[ 3117.686824] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.686829] RIP: 0033:0x7f5693f14b9a
[ 3117.686830] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.686887] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.686892] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.686894] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.686896] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.686899] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.686901] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.687005] Allocated by task 1225:
[ 3117.687152] kasan_kmalloc+0xa6/0xd0
[ 3117.687157] kmem_cache_alloc_trace+0xfd/0x200
[ 3117.687161] f2fs_build_segment_manager+0x2d09/0x3190
[ 3117.687165] f2fs_fill_super+0x1948/0x2b40
[ 3117.687168] mount_bdev+0x1c0/0x200
[ 3117.687171] mount_fs+0x5c/0x190
[ 3117.687174] vfs_kern_mount+0x64/0x190
[ 3117.687177] do_mount+0x2e4/0x1450
[ 3117.687180] ksys_mount+0x7e/0xd0
[ 3117.687182] __x64_sys_mount+0x62/0x70
[ 3117.687186] do_syscall_64+0x73/0x160
[ 3117.687190] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.687285] Freed by task 19:
[ 3117.687412] __kasan_slab_free+0x137/0x190
[ 3117.687416] kfree+0x8b/0x1b0
[ 3117.687460] ttm_bo_man_put_node+0x61/0x80 [ttm]
[ 3117.687476] ttm_bo_cleanup_refs+0x15f/0x250 [ttm]
[ 3117.687492] ttm_bo_delayed_delete+0x2f0/0x300 [ttm]
[ 3117.687507] ttm_bo_delayed_workqueue+0x17/0x50 [ttm]
[ 3117.687528] process_one_work+0x2f9/0x740
[ 3117.687531] worker_thread+0x78/0x6b0
[ 3117.687541] kthread+0x177/0x1c0
[ 3117.687545] ret_from_fork+0x35/0x40
[ 3117.687638] The buggy address belongs to the object at ffff88018f0a6300
which belongs to the cache kmalloc-192 of size 192
[ 3117.688014] The buggy address is located 16 bytes to the right of
192-byte region [ffff88018f0a6300, ffff88018f0a63c0)
[ 3117.688382] The buggy address belongs to the page:
[ 3117.688554] page:ffffea00063c2980 count:1 mapcount:0 mapping:ffff8801f3403180 index:0x0
[ 3117.688788] flags: 0x17fff8000000100(slab)
[ 3117.688944] raw: 017fff8000000100 ffffea00063c2840 0000000e0000000e ffff8801f3403180
[ 3117.689166] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 3117.689386] page dumped because: kasan: bad access detected
[ 3117.689653] Memory state around the buggy address:
[ 3117.689816] ffff88018f0a6280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 3117.690027] ffff88018f0a6300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690239] >ffff88018f0a6380: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.690448] ^
[ 3117.690644] ffff88018f0a6400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690868] ffff88018f0a6480: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.691077] ==================================================================
[ 3117.691290] Disabling lock debugging due to kernel taint
[ 3117.693893] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 3117.694120] PGD 80000001f01bc067 P4D 80000001f01bc067 PUD 1d9638067 PMD 0
[ 3117.694338] Oops: 0002 [#1] SMP KASAN PTI
[ 3117.694490] CPU: 1 PID: 1225 Comm: mount Tainted: G B W 4.17.0+ #1
[ 3117.694703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.695073] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.695246] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.695793] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.695969] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.696182] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.696391] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.696604] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.696813] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.697032] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.697280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.702357] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.707235] Call Trace:
[ 3117.712077] locate_dirty_segment+0x189/0x190
[ 3117.716891] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.721617] recover_data+0x703/0x2c20
[ 3117.726316] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.730957] ? ksys_mount+0x7e/0xd0
[ 3117.735573] ? policy_nodemask+0x1a/0x90
[ 3117.740198] ? policy_node+0x56/0x70
[ 3117.744829] ? add_fsync_inode+0xf0/0xf0
[ 3117.749487] ? blk_finish_plug+0x44/0x60
[ 3117.754152] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.758831] ? find_inode_fast+0xac/0xc0
[ 3117.763448] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.768046] ? __radix_tree_lookup+0x150/0x150
[ 3117.772603] ? dqget+0x670/0x670
[ 3117.777159] ? pagecache_get_page+0x29/0x410
[ 3117.781648] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.786067] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.790476] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.794790] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.799086] ? rb_insert_color+0x323/0x3d0
[ 3117.803304] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.807563] ? proc_register+0x153/0x1d0
[ 3117.811766] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.815947] ? f2fs_attr_store+0x50/0x50
[ 3117.820087] ? proc_create_single_data+0x52/0x60
[ 3117.824262] f2fs_fill_super+0x1d06/0x2b40
[ 3117.828367] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.832432] ? sget_userns+0x65e/0x690
[ 3117.836500] ? set_blocksize+0x88/0x130
[ 3117.840501] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.844420] mount_bdev+0x1c0/0x200
[ 3117.848275] mount_fs+0x5c/0x190
[ 3117.852053] vfs_kern_mount+0x64/0x190
[ 3117.855810] do_mount+0x2e4/0x1450
[ 3117.859441] ? lockref_put_return+0x130/0x130
[ 3117.862996] ? copy_mount_string+0x20/0x20
[ 3117.866417] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.869719] ? kasan_kmalloc+0xa6/0xd0
[ 3117.872948] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.876121] ? __kmalloc_track_caller+0x196/0x210
[ 3117.879333] ? _copy_from_user+0x61/0x90
[ 3117.882467] ? memdup_user+0x3e/0x60
[ 3117.885604] ksys_mount+0x7e/0xd0
[ 3117.888700] __x64_sys_mount+0x62/0x70
[ 3117.891742] do_syscall_64+0x73/0x160
[ 3117.894692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.897669] RIP: 0033:0x7f5693f14b9a
[ 3117.900563] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.906922] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.910159] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.913469] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.916764] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.920071] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.923393] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.926680] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.949979] CR2: 0000000000000000
[ 3117.954283] ---[ end trace a8e0d899985faf32 ]---
[ 3117.958575] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.962810] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.971789] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.976333] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.980926] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.985497] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.990098] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.994761] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.999392] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3118.004096] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3118.008816] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/segment.c#L775
if (test_and_clear_bit(segno, dirty_i->dirty_segmap[t]))
dirty_i->nr_dirty[t]--;
Here dirty_i->dirty_segmap[t] can be NULL which leads to crash in test_and_clear_bit()
Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 19:16:11 +08:00
return 1 ;
}
f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.
- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image
- Reproduce
- Kernel message
[ 548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.584979] ==================================================================
[ 548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[ 548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295
[ 548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[ 548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.589438] Call Trace:
[ 548.589474] dump_stack+0x7b/0xb5
[ 548.589487] print_address_description+0x70/0x290
[ 548.589492] kasan_report+0x291/0x390
[ 548.589496] ? kmemdup+0x36/0x50
[ 548.589509] check_memory_region+0x139/0x190
[ 548.589514] memcpy+0x23/0x50
[ 548.589518] kmemdup+0x36/0x50
[ 548.589545] f2fs_build_segment_manager+0x8fa/0x3410
[ 548.589551] ? __asan_loadN+0xf/0x20
[ 548.589560] ? f2fs_sanity_check_ckpt+0x1be/0x240
[ 548.589566] ? f2fs_flush_sit_entries+0x10c0/0x10c0
[ 548.589587] ? __put_user_ns+0x40/0x40
[ 548.589604] ? find_next_bit+0x57/0x90
[ 548.589610] f2fs_fill_super+0x194b/0x2b40
[ 548.589617] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589637] ? set_blocksize+0x90/0x140
[ 548.589651] mount_bdev+0x1c5/0x210
[ 548.589655] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.589667] f2fs_mount+0x15/0x20
[ 548.589672] mount_fs+0x60/0x1a0
[ 548.589683] ? alloc_vfsmnt+0x309/0x360
[ 548.589688] vfs_kern_mount+0x6b/0x1a0
[ 548.589699] do_mount+0x34a/0x18c0
[ 548.589710] ? lockref_put_or_lock+0xcf/0x160
[ 548.589716] ? copy_mount_string+0x20/0x20
[ 548.589728] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.589734] ? kasan_check_write+0x14/0x20
[ 548.589740] ? _copy_from_user+0x6a/0x90
[ 548.589744] ? memdup_user+0x42/0x60
[ 548.589750] ksys_mount+0x83/0xd0
[ 548.589755] __x64_sys_mount+0x67/0x80
[ 548.589781] do_syscall_64+0x78/0x170
[ 548.589797] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.589820] RIP: 0033:0x7f76fc331b9a
[ 548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.590242] The buggy address belongs to the page:
[ 548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 548.592886] flags: 0x2ffff0000000000()
[ 548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[ 548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 548.603713] page dumped because: kasan: bad access detected
[ 548.605203] Memory state around the buggy address:
[ 548.606198] ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.607676] ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.610629] ^
[ 548.612088] ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.613674] ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 548.615141] ==================================================================
[ 548.616613] Disabling lock debugging due to kernel taint
[ 548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[ 548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[ 548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G B 4.18.0-rc1+ #4
[ 548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[ 548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[ 548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[ 548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[ 548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[ 548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[ 548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[ 548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[ 548.623299] FS: 00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 548.623302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[ 548.623317] Call Trace:
[ 548.623325] ? kasan_check_read+0x11/0x20
[ 548.623330] ? __zone_watermark_ok+0x92/0x240
[ 548.623336] ? get_page_from_freelist+0x1c3/0x1d90
[ 548.623347] ? _raw_spin_lock_irqsave+0x2a/0x60
[ 548.623353] ? warn_alloc+0x250/0x250
[ 548.623358] ? save_stack+0x46/0xd0
[ 548.623361] ? kasan_kmalloc+0xad/0xe0
[ 548.623366] ? __isolate_free_page+0x2a0/0x2a0
[ 548.623370] ? mount_fs+0x60/0x1a0
[ 548.623374] ? vfs_kern_mount+0x6b/0x1a0
[ 548.623378] ? do_mount+0x34a/0x18c0
[ 548.623383] ? ksys_mount+0x83/0xd0
[ 548.623387] ? __x64_sys_mount+0x67/0x80
[ 548.623391] ? do_syscall_64+0x78/0x170
[ 548.623396] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623401] __alloc_pages_nodemask+0x3c5/0x400
[ 548.623407] ? __alloc_pages_slowpath+0x1420/0x1420
[ 548.623412] ? __mutex_lock_slowpath+0x20/0x20
[ 548.623417] ? kvmalloc_node+0x31/0x80
[ 548.623424] alloc_pages_current+0x75/0x110
[ 548.623436] kmalloc_order+0x24/0x60
[ 548.623442] kmalloc_order_trace+0x24/0xb0
[ 548.623448] __kmalloc_track_caller+0x207/0x220
[ 548.623455] ? f2fs_build_node_manager+0x399/0xbb0
[ 548.623460] kmemdup+0x20/0x50
[ 548.623465] f2fs_build_node_manager+0x399/0xbb0
[ 548.623470] f2fs_fill_super+0x195e/0x2b40
[ 548.623477] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623481] ? set_blocksize+0x90/0x140
[ 548.623486] mount_bdev+0x1c5/0x210
[ 548.623489] ? f2fs_commit_super+0x1b0/0x1b0
[ 548.623495] f2fs_mount+0x15/0x20
[ 548.623498] mount_fs+0x60/0x1a0
[ 548.623503] ? alloc_vfsmnt+0x309/0x360
[ 548.623508] vfs_kern_mount+0x6b/0x1a0
[ 548.623513] do_mount+0x34a/0x18c0
[ 548.623518] ? lockref_put_or_lock+0xcf/0x160
[ 548.623523] ? copy_mount_string+0x20/0x20
[ 548.623528] ? memcg_kmem_put_cache+0x1b/0xa0
[ 548.623533] ? kasan_check_write+0x14/0x20
[ 548.623537] ? _copy_from_user+0x6a/0x90
[ 548.623542] ? memdup_user+0x42/0x60
[ 548.623547] ksys_mount+0x83/0xd0
[ 548.623552] __x64_sys_mount+0x67/0x80
[ 548.623557] do_syscall_64+0x78/0x170
[ 548.623562] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 548.623566] RIP: 0033:0x7f76fc331b9a
[ 548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[ 548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[ 548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[ 548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[ 548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[ 548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[ 548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)
[ 548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[ 548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager
- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578
sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);
Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.
Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-23 11:25:19 +08:00
f2fs: fix to check layout on last valid checkpoint park
As Ju Hyung reported:
"
I was semi-forced today to use the new kernel and test f2fs.
My Ubuntu initramfs got a bit wonky and I had to boot into live CD and
fix some stuffs. The live CD was using 4.15 kernel, and just mounting
the f2fs partition there corrupted f2fs and my 4.19(with 5.1-rc1-4.19
f2fs-stable merged) refused to mount with "SIT is corrupted node"
message.
I used the latest f2fs-tools sent by Chao including "fsck.f2fs: fix to
repair cp_loads blocks at correct position"
It spit out 140M worth of output, but at least I didn't have to run it
twice. Everything returned "Ok" in the 2nd run.
The new log is at
http://arter97.com/f2fs/final
After fixing the image, I used my 4.19 kernel with 5.2-rc1-4.19
f2fs-stable merged and it mounted.
But, I got this:
[ 1.047791] F2FS-fs (nvme0n1p3): layout of large_nat_bitmap is
deprecated, run fsck to repair, chksum_offset: 4092
[ 1.081307] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint
[ 1.161520] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs
[ 1.162418] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7e00
But after doing a reboot, the message is gone:
[ 1.098423] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint
[ 1.177771] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs
[ 1.178365] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7eda
I'm not exactly sure why the kernel detected that I'm still using the
old layout on the first boot. Maybe fsck didn't fix it properly, or
the check from the kernel is improper.
"
Although we have rebuild the old deprecated checkpoint with new layout
during repair, we only repair last checkpoint park, the other old one is
remained.
Once the image was mounted, we will 1) sanity check layout and 2) decide
which checkpoint park to use according to cp_ver. So that we will print
reported message unnecessarily at step 1), to avoid it, we simply move
layout check into f2fs_sanity_check_ckpt() after step 2).
Reported-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-05-20 10:09:22 +08:00
if ( __is_set_ckpt_flags ( ckpt , CP_LARGE_NAT_BITMAP_FLAG ) & &
le32_to_cpu ( ckpt - > checksum_offset ) ! = CP_MIN_CHKSUM_OFFSET ) {
2019-07-11 09:29:15 +08:00
f2fs_warn ( sbi , " using deprecated layout of large_nat_bitmap, "
" please run fsck v1.13.0 or higher to repair, chksum_offset: %u, "
" fixed with patch: \" f2fs-tools: relocate chksum_offset for large_nat_bitmap feature \" " ,
2019-06-18 17:48:42 +08:00
le32_to_cpu ( ckpt - > checksum_offset ) ) ;
f2fs: fix to check layout on last valid checkpoint park
As Ju Hyung reported:
"
I was semi-forced today to use the new kernel and test f2fs.
My Ubuntu initramfs got a bit wonky and I had to boot into live CD and
fix some stuffs. The live CD was using 4.15 kernel, and just mounting
the f2fs partition there corrupted f2fs and my 4.19(with 5.1-rc1-4.19
f2fs-stable merged) refused to mount with "SIT is corrupted node"
message.
I used the latest f2fs-tools sent by Chao including "fsck.f2fs: fix to
repair cp_loads blocks at correct position"
It spit out 140M worth of output, but at least I didn't have to run it
twice. Everything returned "Ok" in the 2nd run.
The new log is at
http://arter97.com/f2fs/final
After fixing the image, I used my 4.19 kernel with 5.2-rc1-4.19
f2fs-stable merged and it mounted.
But, I got this:
[ 1.047791] F2FS-fs (nvme0n1p3): layout of large_nat_bitmap is
deprecated, run fsck to repair, chksum_offset: 4092
[ 1.081307] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint
[ 1.161520] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs
[ 1.162418] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7e00
But after doing a reboot, the message is gone:
[ 1.098423] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint
[ 1.177771] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs
[ 1.178365] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7eda
I'm not exactly sure why the kernel detected that I'm still using the
old layout on the first boot. Maybe fsck didn't fix it properly, or
the check from the kernel is improper.
"
Although we have rebuild the old deprecated checkpoint with new layout
during repair, we only repair last checkpoint park, the other old one is
remained.
Once the image was mounted, we will 1) sanity check layout and 2) decide
which checkpoint park to use according to cp_ver. So that we will print
reported message unnecessarily at step 1), to avoid it, we simply move
layout check into f2fs_sanity_check_ckpt() after step 2).
Reported-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-05-20 10:09:22 +08:00
return 1 ;
}
2021-08-06 08:04:37 +08:00
nat_blocks = nat_segs < < log_blocks_per_seg ;
nat_bits_bytes = nat_blocks / BITS_PER_BYTE ;
nat_bits_blocks = F2FS_BLK_ALIGN ( ( nat_bits_bytes < < 1 ) + 8 ) ;
if ( __is_set_ckpt_flags ( ckpt , CP_NAT_BITS_FLAG ) & &
( cp_payload + F2FS_CP_PACKS +
NR_CURSEG_PERSIST_TYPE + nat_bits_blocks > = blocks_per_seg ) ) {
f2fs_warn ( sbi , " Insane cp_payload: %u, nat_bits_blocks: %u) " ,
cp_payload , nat_bits_blocks ) ;
2021-10-28 20:45:08 +08:00
return 1 ;
2021-08-06 08:04:37 +08:00
}
2014-08-11 16:49:25 -07:00
if ( unlikely ( f2fs_cp_error ( sbi ) ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " A bug case: need to run fsck " ) ;
f2fs: prevent checkpoint once any IO failure is detected
This patch enhances the checkpoint routine to cope with IO errors.
Basically f2fs detects IO errors from end_io_write, and the errors are able to
be occurred during one of data, node, and meta page writes.
In the previous code, when an IO error is occurred during writes, f2fs sets a
flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
Afterwards, write_checkpoint() will check the flag and remount f2fs as a
read-only (ro) mode.
However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
freely able to be written to disk by flusher or kswapd in background.
In such a case, after cold reboot, f2fs would restore the checkpoint data having
CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
a ro mode again.
Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
occurred, and remount f2fs as a ro mode right away at that moment.
Reported-by: Oliver Winker <oliver@oli1170.net>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2013-01-24 19:56:11 +09:00
return 1 ;
}
2012-11-02 17:07:47 +09:00
return 0 ;
}
static void init_sb_info ( struct f2fs_sb_info * sbi )
{
struct f2fs_super_block * raw_super = sbi - > raw_super ;
2018-10-24 16:09:42 +08:00
int i ;
2012-11-02 17:07:47 +09:00
sbi - > log_sectors_per_block =
le32_to_cpu ( raw_super - > log_sectors_per_block ) ;
sbi - > log_blocksize = le32_to_cpu ( raw_super - > log_blocksize ) ;
sbi - > blocksize = 1 < < sbi - > log_blocksize ;
sbi - > log_blocks_per_seg = le32_to_cpu ( raw_super - > log_blocks_per_seg ) ;
sbi - > blocks_per_seg = 1 < < sbi - > log_blocks_per_seg ;
sbi - > segs_per_sec = le32_to_cpu ( raw_super - > segs_per_sec ) ;
sbi - > secs_per_zone = le32_to_cpu ( raw_super - > secs_per_zone ) ;
sbi - > total_sections = le32_to_cpu ( raw_super - > section_count ) ;
sbi - > total_node_count =
( le32_to_cpu ( raw_super - > segment_count_nat ) / 2 )
* sbi - > blocks_per_seg * NAT_ENTRY_PER_BLOCK ;
2020-12-07 18:59:33 +08:00
F2FS_ROOT_INO ( sbi ) = le32_to_cpu ( raw_super - > root_ino ) ;
F2FS_NODE_INO ( sbi ) = le32_to_cpu ( raw_super - > node_ino ) ;
F2FS_META_INO ( sbi ) = le32_to_cpu ( raw_super - > meta_ino ) ;
2013-03-31 13:26:03 +09:00
sbi - > cur_victim_sec = NULL_SECNO ;
2022-03-17 16:33:15 +08:00
sbi - > gc_mode = GC_NORMAL ;
f2fs: support subsectional garbage collection
Section is minimal garbage collection unit of f2fs, in zoned block
device, or ancient block mapping flash device, in order to improve
GC efficiency, we can align GC unit to lower device erase unit,
normally, it consists of multiple of segments.
Once background or foreground GC triggers, it brings a large number
of IOs, which will impact user IO, and also occupy cpu/memory resource
intensively.
So, to reduce impact of GC on large size section, this patch supports
subsectional GC, in one cycle of GC, it only migrate partial segment{s}
in victim section. Currently, by default, we use sbi->segs_per_sec as
migration granularity.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-24 18:37:27 +08:00
sbi - > next_victim_seg [ BG_GC ] = NULL_SEGNO ;
sbi - > next_victim_seg [ FG_GC ] = NULL_SEGNO ;
2014-01-08 13:45:08 +09:00
sbi - > max_victim_search = DEF_MAX_VICTIM_SEARCH ;
f2fs: support subsectional garbage collection
Section is minimal garbage collection unit of f2fs, in zoned block
device, or ancient block mapping flash device, in order to improve
GC efficiency, we can align GC unit to lower device erase unit,
normally, it consists of multiple of segments.
Once background or foreground GC triggers, it brings a large number
of IOs, which will impact user IO, and also occupy cpu/memory resource
intensively.
So, to reduce impact of GC on large size section, this patch supports
subsectional GC, in one cycle of GC, it only migrate partial segment{s}
in victim section. Currently, by default, we use sbi->segs_per_sec as
migration granularity.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-24 18:37:27 +08:00
sbi - > migration_granularity = sbi - > segs_per_sec ;
2021-08-02 21:22:45 -07:00
sbi - > seq_file_ra_mul = MIN_RA_MUL ;
2021-09-29 11:12:03 -07:00
sbi - > max_fragment_chunk = DEF_FRAGMENT_SIZE ;
sbi - > max_fragment_hole = DEF_FRAGMENT_SIZE ;
2022-10-25 14:50:25 +08:00
spin_lock_init ( & sbi - > gc_remaining_trials_lock ) ;
2022-07-18 16:02:48 -07:00
atomic64_set ( & sbi - > current_atomic_write , 0 ) ;
2012-11-02 17:07:47 +09:00
2014-02-27 20:09:05 +09:00
sbi - > dir_level = DEF_DIR_LEVEL ;
2016-01-08 15:51:50 -08:00
sbi - > interval_time [ CP_TIME ] = DEF_CP_INTERVAL ;
2016-01-08 16:57:48 -08:00
sbi - > interval_time [ REQ_TIME ] = DEF_IDLE_INTERVAL ;
2018-09-19 14:18:47 +05:30
sbi - > interval_time [ DISCARD_TIME ] = DEF_IDLE_INTERVAL ;
sbi - > interval_time [ GC_TIME ] = DEF_IDLE_INTERVAL ;
2018-08-20 19:21:43 -07:00
sbi - > interval_time [ DISABLE_TIME ] = DEF_DISABLE_INTERVAL ;
2019-01-14 10:42:11 -08:00
sbi - > interval_time [ UMOUNT_DISCARD_TIMEOUT ] =
DEF_UMOUNT_DISCARD_TIMEOUT ;
2015-01-28 17:48:42 +08:00
clear_sbi_flag ( sbi , SBI_NEED_FSCK ) ;
2015-06-19 12:01:21 -07:00
2016-10-20 19:09:57 -07:00
for ( i = 0 ; i < NR_COUNT_TYPE ; i + + )
atomic_set ( & sbi - > nr_pages [ i ] , 0 ) ;
2018-06-04 23:20:36 +08:00
for ( i = 0 ; i < META ; i + + )
atomic_set ( & sbi - > wb_sync_req [ i ] , 0 ) ;
2017-03-28 18:07:38 -07:00
2015-06-19 12:01:21 -07:00
INIT_LIST_HEAD ( & sbi - > s_list ) ;
mutex_init ( & sbi - > umount_mutex ) ;
2022-01-07 12:48:44 -08:00
init_f2fs_rwsem ( & sbi - > io_order_lock ) ;
2016-09-20 11:04:18 +08:00
spin_lock_init ( & sbi - > cp_lock ) ;
2017-09-29 13:59:39 +08:00
sbi - > dirty_device = 0 ;
spin_lock_init ( & sbi - > dev_lock ) ;
2018-02-11 22:53:20 +08:00
2022-01-07 12:48:44 -08:00
init_f2fs_rwsem ( & sbi - > sb_lock ) ;
init_f2fs_rwsem ( & sbi - > pin_sem ) ;
2012-11-02 17:07:47 +09:00
}
2016-05-13 12:36:58 -07:00
static int init_percpu_info ( struct f2fs_sb_info * sbi )
{
2016-10-20 19:09:57 -07:00
int err ;
2016-05-16 11:06:50 -07:00
2016-05-16 11:42:32 -07:00
err = percpu_counter_init ( & sbi - > alloc_valid_block_count , 0 , GFP_KERNEL ) ;
if ( err )
return err ;
2022-01-27 13:31:43 -08:00
err = percpu_counter_init ( & sbi - > rf_node_block_count , 0 , GFP_KERNEL ) ;
if ( err )
goto err_valid_block ;
2018-09-05 14:54:02 +08:00
err = percpu_counter_init ( & sbi - > total_valid_inode_count , 0 ,
2016-05-16 11:06:50 -07:00
GFP_KERNEL ) ;
2018-09-05 14:54:02 +08:00
if ( err )
2022-01-27 13:31:43 -08:00
goto err_node_block ;
return 0 ;
2018-09-05 14:54:02 +08:00
2022-01-27 13:31:43 -08:00
err_node_block :
percpu_counter_destroy ( & sbi - > rf_node_block_count ) ;
err_valid_block :
percpu_counter_destroy ( & sbi - > alloc_valid_block_count ) ;
2018-09-05 14:54:02 +08:00
return err ;
2016-05-13 12:36:58 -07:00
}
2016-10-28 17:45:05 +09:00
# ifdef CONFIG_BLK_DEV_ZONED
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
struct f2fs_report_zones_args {
2022-06-28 10:57:24 -07:00
struct f2fs_sb_info * sbi ;
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
struct f2fs_dev_info * dev ;
} ;
2019-11-11 11:39:30 +09:00
static int f2fs_report_zone_cb ( struct blk_zone * zone , unsigned int idx ,
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
void * data )
2019-11-11 11:39:30 +09:00
{
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
struct f2fs_report_zones_args * rz_args = data ;
2022-06-28 10:57:24 -07:00
block_t unusable_blocks = ( zone - > len - zone - > capacity ) > >
F2FS_LOG_SECTORS_PER_BLOCK ;
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
if ( zone - > type = = BLK_ZONE_TYPE_CONVENTIONAL )
return 0 ;
set_bit ( idx , rz_args - > dev - > blkz_seq ) ;
2022-06-28 10:57:24 -07:00
if ( ! rz_args - > sbi - > unusable_blocks_per_sec ) {
rz_args - > sbi - > unusable_blocks_per_sec = unusable_blocks ;
return 0 ;
}
if ( rz_args - > sbi - > unusable_blocks_per_sec ! = unusable_blocks ) {
f2fs_err ( rz_args - > sbi , " F2FS supports single zone capacity \n " ) ;
return - EINVAL ;
}
2019-11-11 11:39:30 +09:00
return 0 ;
}
2016-10-06 19:02:05 -07:00
static int init_blkz_info ( struct f2fs_sb_info * sbi , int devi )
2016-10-28 17:45:05 +09:00
{
2016-10-06 19:02:05 -07:00
struct block_device * bdev = FDEV ( devi ) . bdev ;
2020-11-26 18:43:37 +01:00
sector_t nr_sectors = bdev_nr_sectors ( bdev ) ;
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
struct f2fs_report_zones_args rep_zone_arg ;
2022-04-27 18:02:53 +02:00
u64 zone_sectors ;
2019-11-11 11:39:30 +09:00
int ret ;
2016-10-28 17:45:05 +09:00
2018-10-24 18:34:26 +08:00
if ( ! f2fs_sb_has_blkzoned ( sbi ) )
2016-10-28 17:45:05 +09:00
return 0 ;
2022-04-27 18:02:53 +02:00
zone_sectors = bdev_zone_sectors ( bdev ) ;
2022-04-27 18:02:54 +02:00
if ( ! is_power_of_2 ( zone_sectors ) ) {
f2fs_err ( sbi , " F2FS does not support non power of 2 zone sizes \n " ) ;
return - EINVAL ;
}
2022-04-27 18:02:53 +02:00
2016-10-06 19:02:05 -07:00
if ( sbi - > blocks_per_blkz & & sbi - > blocks_per_blkz ! =
2022-04-27 18:02:53 +02:00
SECTOR_TO_BLOCK ( zone_sectors ) )
2016-10-06 19:02:05 -07:00
return - EINVAL ;
2022-04-27 18:02:53 +02:00
sbi - > blocks_per_blkz = SECTOR_TO_BLOCK ( zone_sectors ) ;
2016-10-06 19:02:05 -07:00
if ( sbi - > log_blocks_per_blkz & & sbi - > log_blocks_per_blkz ! =
__ilog2_u32 ( sbi - > blocks_per_blkz ) )
return - EINVAL ;
2016-10-28 17:45:05 +09:00
sbi - > log_blocks_per_blkz = __ilog2_u32 ( sbi - > blocks_per_blkz ) ;
2016-10-06 19:02:05 -07:00
FDEV ( devi ) . nr_blkz = SECTOR_TO_BLOCK ( nr_sectors ) > >
sbi - > log_blocks_per_blkz ;
2022-04-27 18:02:53 +02:00
if ( nr_sectors & ( zone_sectors - 1 ) )
2016-10-06 19:02:05 -07:00
FDEV ( devi ) . nr_blkz + + ;
2016-10-28 17:45:05 +09:00
2020-06-04 21:57:48 -07:00
FDEV ( devi ) . blkz_seq = f2fs_kvzalloc ( sbi ,
2019-03-16 09:13:07 +09:00
BITS_TO_LONGS ( FDEV ( devi ) . nr_blkz )
* sizeof ( unsigned long ) ,
GFP_KERNEL ) ;
if ( ! FDEV ( devi ) . blkz_seq )
2016-10-28 17:45:05 +09:00
return - ENOMEM ;
2022-06-28 10:57:24 -07:00
rep_zone_arg . sbi = sbi ;
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
rep_zone_arg . dev = & FDEV ( devi ) ;
2019-11-11 11:39:30 +09:00
ret = blkdev_report_zones ( bdev , 0 , BLK_ALL_ZONES , f2fs_report_zone_cb ,
f2fs: support zone capacity less than zone size
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-16 18:26:56 +05:30
& rep_zone_arg ) ;
2019-11-11 11:39:30 +09:00
if ( ret < 0 )
return ret ;
return 0 ;
2016-10-28 17:45:05 +09:00
}
# endif
2013-10-14 18:47:11 +08:00
/*
* Read f2fs raw super block .
2016-02-17 08:59:01 +08:00
* Because we have two copies of super block , so read both of them
* to get the first valid one . If any one of them is broken , we pass
* them recovery flag back to the caller .
2013-10-14 18:47:11 +08:00
*/
2016-03-23 17:05:27 -07:00
static int read_raw_super_block ( struct f2fs_sb_info * sbi ,
2013-10-14 18:47:11 +08:00
struct f2fs_super_block * * raw_super ,
2015-12-15 17:19:26 +08:00
int * valid_super_block , int * recovery )
2013-02-01 19:07:03 +08:00
{
2016-03-23 17:05:27 -07:00
struct super_block * sb = sbi - > sb ;
2016-02-17 08:59:01 +08:00
int block ;
2015-12-15 17:19:26 +08:00
struct buffer_head * bh ;
2016-03-20 15:33:20 -07:00
struct f2fs_super_block * super ;
2015-05-21 14:42:53 +08:00
int err = 0 ;
2013-02-01 19:07:03 +08:00
2015-12-15 17:17:20 +08:00
super = kzalloc ( sizeof ( struct f2fs_super_block ) , GFP_KERNEL ) ;
if ( ! super )
return - ENOMEM ;
2016-02-17 08:59:01 +08:00
for ( block = 0 ; block < 2 ; block + + ) {
bh = sb_bread ( sb , block ) ;
if ( ! bh ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Unable to read %dth superblock " ,
block + 1 ) ;
2016-02-17 08:59:01 +08:00
err = - EIO ;
2019-09-27 09:35:48 +08:00
* recovery = 1 ;
2016-02-17 08:59:01 +08:00
continue ;
}
2013-02-01 19:07:03 +08:00
2016-02-17 08:59:01 +08:00
/* sanity checking of raw super */
2019-07-25 11:08:52 +08:00
err = sanity_check_raw_super ( sbi , bh ) ;
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Can't find valid F2FS filesystem in %dth superblock " ,
block + 1 ) ;
2016-02-17 08:59:01 +08:00
brelse ( bh ) ;
2019-09-27 09:35:48 +08:00
* recovery = 1 ;
2016-02-17 08:59:01 +08:00
continue ;
}
2013-02-01 19:07:03 +08:00
2016-02-17 08:59:01 +08:00
if ( ! * raw_super ) {
2016-03-20 15:33:20 -07:00
memcpy ( super , bh - > b_data + F2FS_SUPER_OFFSET ,
sizeof ( * super ) ) ;
2016-02-17 08:59:01 +08:00
* valid_super_block = block ;
* raw_super = super ;
}
brelse ( bh ) ;
2015-05-21 14:42:53 +08:00
}
/* No valid superblock */
2016-02-17 08:59:01 +08:00
if ( ! * raw_super )
2020-06-10 01:14:46 +03:00
kfree ( super ) ;
2016-02-17 08:59:01 +08:00
else
err = 0 ;
2015-05-21 14:42:53 +08:00
2016-02-17 08:59:01 +08:00
return err ;
2013-02-01 19:07:03 +08:00
}
2016-03-20 15:33:20 -07:00
int f2fs_commit_super ( struct f2fs_sb_info * sbi , bool recover )
2015-04-20 18:49:51 -07:00
{
2015-12-07 10:16:58 -08:00
struct buffer_head * bh ;
2018-09-28 20:25:56 +08:00
__u32 crc = 0 ;
2015-04-20 18:49:51 -07:00
int err ;
2016-03-23 17:05:27 -07:00
if ( ( recover & & f2fs_readonly ( sbi - > sb ) ) | |
bdev_read_only ( sbi - > sb - > s_bdev ) ) {
set_sbi_flag ( sbi , SBI_NEED_SB_WRITE ) ;
2016-03-23 10:42:01 -07:00
return - EROFS ;
2016-03-23 17:05:27 -07:00
}
2016-03-23 10:42:01 -07:00
2018-09-28 20:25:56 +08:00
/* we should update superblock crc here */
2018-10-24 18:34:26 +08:00
if ( ! recover & & f2fs_sb_has_sb_chksum ( sbi ) ) {
2018-09-28 20:25:56 +08:00
crc = f2fs_crc32 ( sbi , F2FS_RAW_SUPER ( sbi ) ,
offsetof ( struct f2fs_super_block , crc ) ) ;
F2FS_RAW_SUPER ( sbi ) - > crc = cpu_to_le32 ( crc ) ;
}
2016-03-20 15:33:20 -07:00
/* write back-up superblock first */
2018-01-29 19:13:15 +08:00
bh = sb_bread ( sbi - > sb , sbi - > valid_super_block ? 0 : 1 ) ;
2015-12-07 10:16:58 -08:00
if ( ! bh )
return - EIO ;
2016-03-20 15:33:20 -07:00
err = __f2fs_commit_super ( bh , F2FS_RAW_SUPER ( sbi ) ) ;
2015-12-07 10:16:58 -08:00
brelse ( bh ) ;
2015-06-08 13:28:03 +08:00
/* if we are in recovery path, skip writing valid superblock */
if ( recover | | err )
2015-12-07 10:16:58 -08:00
return err ;
2015-04-20 18:49:51 -07:00
/* write current valid superblock */
2018-01-29 19:13:15 +08:00
bh = sb_bread ( sbi - > sb , sbi - > valid_super_block ) ;
2016-03-20 15:33:20 -07:00
if ( ! bh )
return - EIO ;
err = __f2fs_commit_super ( bh , F2FS_RAW_SUPER ( sbi ) ) ;
brelse ( bh ) ;
return err ;
2015-04-20 18:49:51 -07:00
}
2022-09-28 23:38:53 +08:00
void f2fs_handle_stop ( struct f2fs_sb_info * sbi , unsigned char reason )
{
struct f2fs_super_block * raw_super = F2FS_RAW_SUPER ( sbi ) ;
int err ;
f2fs_down_write ( & sbi - > sb_lock ) ;
if ( raw_super - > s_stop_reason [ reason ] < ( ( 1 < < BITS_PER_BYTE ) - 1 ) )
raw_super - > s_stop_reason [ reason ] + + ;
err = f2fs_commit_super ( sbi , false ) ;
if ( err )
f2fs_err ( sbi , " f2fs_commit_super fails to record reason:%u err:%d " ,
reason , err ) ;
2022-09-28 23:38:54 +08:00
f2fs_up_write ( & sbi - > sb_lock ) ;
}
static void f2fs_save_errors ( struct f2fs_sb_info * sbi , unsigned char flag )
{
spin_lock ( & sbi - > error_lock ) ;
if ( ! test_bit ( flag , ( unsigned long * ) sbi - > errors ) ) {
set_bit ( flag , ( unsigned long * ) sbi - > errors ) ;
sbi - > error_dirty = true ;
}
spin_unlock ( & sbi - > error_lock ) ;
}
static bool f2fs_update_errors ( struct f2fs_sb_info * sbi )
{
bool need_update = false ;
spin_lock ( & sbi - > error_lock ) ;
if ( sbi - > error_dirty ) {
memcpy ( F2FS_RAW_SUPER ( sbi ) - > s_errors , sbi - > errors ,
MAX_F2FS_ERRORS ) ;
sbi - > error_dirty = false ;
need_update = true ;
}
spin_unlock ( & sbi - > error_lock ) ;
return need_update ;
}
2022-09-28 23:38:53 +08:00
2022-09-28 23:38:54 +08:00
void f2fs_handle_error ( struct f2fs_sb_info * sbi , unsigned char error )
{
int err ;
f2fs_save_errors ( sbi , error ) ;
f2fs_down_write ( & sbi - > sb_lock ) ;
if ( ! f2fs_update_errors ( sbi ) )
goto out_unlock ;
err = f2fs_commit_super ( sbi , false ) ;
if ( err )
f2fs_err ( sbi , " f2fs_commit_super fails to record errors:%u, err:%d " ,
error , err ) ;
out_unlock :
2022-09-28 23:38:53 +08:00
f2fs_up_write ( & sbi - > sb_lock ) ;
}
2016-10-06 19:02:05 -07:00
static int f2fs_scan_devices ( struct f2fs_sb_info * sbi )
{
struct f2fs_super_block * raw_super = F2FS_RAW_SUPER ( sbi ) ;
2017-02-27 20:52:49 +09:00
unsigned int max_devices = MAX_DEVICES ;
2021-09-01 14:39:20 +08:00
unsigned int logical_blksize ;
2016-10-06 19:02:05 -07:00
int i ;
2017-02-27 20:52:49 +09:00
/* Initialize single device information */
if ( ! RDEV ( 0 ) . path [ 0 ] ) {
if ( ! bdev_is_zoned ( sbi - > sb - > s_bdev ) )
2016-10-06 19:02:05 -07:00
return 0 ;
2017-02-27 20:52:49 +09:00
max_devices = 1 ;
}
2016-10-06 19:02:05 -07:00
2017-02-27 20:52:49 +09:00
/*
* Initialize multiple devices information , or single
* zoned block device information .
*/
treewide: Use array_size() in f2fs_kzalloc()
The f2fs_kzalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:
f2fs_kzalloc(handle, a * b, gfp)
with:
f2fs_kzalloc(handle, array_size(a, b), gfp)
as well as handling cases of:
f2fs_kzalloc(handle, a * b * c, gfp)
with:
f2fs_kzalloc(handle, array3_size(a, b, c), gfp)
This does, however, attempt to ignore constant size factors like:
f2fs_kzalloc(handle, 4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@
(
f2fs_kzalloc(HANDLE,
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
f2fs_kzalloc(HANDLE,
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@
(
f2fs_kzalloc(HANDLE,
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(char) * COUNT
+ COUNT
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * COUNT_ID
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * COUNT_ID
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
)
// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@
f2fs_kzalloc(HANDLE,
- SIZE * COUNT
+ array_size(COUNT, SIZE)
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
f2fs_kzalloc(HANDLE,
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
f2fs_kzalloc(HANDLE,
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@
(
f2fs_kzalloc(HANDLE,
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
f2fs_kzalloc(HANDLE,
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
f2fs_kzalloc(HANDLE, C1 * C2 * C3, ...)
|
f2fs_kzalloc(HANDLE,
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@
(
f2fs_kzalloc(HANDLE, C1 * C2, ...)
|
f2fs_kzalloc(HANDLE,
- E1 * E2
+ array_size(E1, E2)
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 14:28:23 -07:00
sbi - > devs = f2fs_kzalloc ( sbi ,
array_size ( max_devices ,
sizeof ( struct f2fs_dev_info ) ) ,
GFP_KERNEL ) ;
2017-02-27 20:52:49 +09:00
if ( ! sbi - > devs )
return - ENOMEM ;
2016-10-06 19:02:05 -07:00
2021-09-01 14:39:20 +08:00
logical_blksize = bdev_logical_block_size ( sbi - > sb - > s_bdev ) ;
sbi - > aligned_blksize = true ;
2017-02-27 20:52:49 +09:00
for ( i = 0 ; i < max_devices ; i + + ) {
2016-10-06 19:02:05 -07:00
2017-02-27 20:52:49 +09:00
if ( i > 0 & & ! RDEV ( i ) . path [ 0 ] )
break ;
if ( max_devices = = 1 ) {
/* Single zoned block device mount */
FDEV ( 0 ) . bdev =
blkdev_get_by_dev ( sbi - > sb - > s_bdev - > bd_dev ,
2016-10-06 19:02:05 -07:00
sbi - > sb - > s_mode , sbi - > sb - > s_type ) ;
2017-02-27 20:52:49 +09:00
} else {
/* Multi-device mount */
memcpy ( FDEV ( i ) . path , RDEV ( i ) . path , MAX_PATH_LEN ) ;
FDEV ( i ) . total_segments =
le32_to_cpu ( RDEV ( i ) . total_segments ) ;
if ( i = = 0 ) {
FDEV ( i ) . start_blk = 0 ;
FDEV ( i ) . end_blk = FDEV ( i ) . start_blk +
( FDEV ( i ) . total_segments < <
sbi - > log_blocks_per_seg ) - 1 +
le32_to_cpu ( raw_super - > segment0_blkaddr ) ;
} else {
FDEV ( i ) . start_blk = FDEV ( i - 1 ) . end_blk + 1 ;
FDEV ( i ) . end_blk = FDEV ( i ) . start_blk +
( FDEV ( i ) . total_segments < <
sbi - > log_blocks_per_seg ) - 1 ;
}
FDEV ( i ) . bdev = blkdev_get_by_path ( FDEV ( i ) . path ,
2016-10-06 19:02:05 -07:00
sbi - > sb - > s_mode , sbi - > sb - > s_type ) ;
2017-02-27 20:52:49 +09:00
}
2016-10-06 19:02:05 -07:00
if ( IS_ERR ( FDEV ( i ) . bdev ) )
return PTR_ERR ( FDEV ( i ) . bdev ) ;
/* to release errored devices */
sbi - > s_ndevs = i + 1 ;
2021-09-01 14:39:20 +08:00
if ( logical_blksize ! = bdev_logical_block_size ( FDEV ( i ) . bdev ) )
sbi - > aligned_blksize = false ;
2016-10-06 19:02:05 -07:00
# ifdef CONFIG_BLK_DEV_ZONED
if ( bdev_zoned_model ( FDEV ( i ) . bdev ) = = BLK_ZONED_HM & &
2018-10-24 18:34:26 +08:00
! f2fs_sb_has_blkzoned ( sbi ) ) {
2021-05-26 13:05:36 -07:00
f2fs_err ( sbi , " Zoned block device feature not enabled " ) ;
2016-10-06 19:02:05 -07:00
return - EINVAL ;
}
if ( bdev_zoned_model ( FDEV ( i ) . bdev ) ! = BLK_ZONED_NONE ) {
if ( init_blkz_info ( sbi , i ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to initialize F2FS blkzone information " ) ;
2016-10-06 19:02:05 -07:00
return - EINVAL ;
}
2017-02-27 20:52:49 +09:00
if ( max_devices = = 1 )
break ;
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Mount Device [%2d]: %20s, %8u, %8x - %8x (zone: %s) " ,
i , FDEV ( i ) . path ,
FDEV ( i ) . total_segments ,
FDEV ( i ) . start_blk , FDEV ( i ) . end_blk ,
bdev_zoned_model ( FDEV ( i ) . bdev ) = = BLK_ZONED_HA ?
" Host-aware " : " Host-managed " ) ;
2016-10-06 19:02:05 -07:00
continue ;
}
# endif
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Mount Device [%2d]: %20s, %8u, %8x - %8x " ,
i , FDEV ( i ) . path ,
FDEV ( i ) . total_segments ,
FDEV ( i ) . start_blk , FDEV ( i ) . end_blk ) ;
}
f2fs_info ( sbi ,
" IO Block Size: %8d KB " , F2FS_IO_SIZE_KB ( sbi ) ) ;
2016-10-06 19:02:05 -07:00
return 0 ;
}
2019-07-23 16:05:28 -07:00
static int f2fs_setup_casefold ( struct f2fs_sb_info * sbi )
{
2022-01-18 07:56:14 +01:00
# if IS_ENABLED(CONFIG_UNICODE)
2020-07-08 02:12:36 -07:00
if ( f2fs_sb_has_casefold ( sbi ) & & ! sbi - > sb - > s_encoding ) {
2019-07-23 16:05:28 -07:00
const struct f2fs_sb_encodings * encoding_info ;
struct unicode_map * encoding ;
__u16 encoding_flags ;
2021-09-15 08:59:57 +02:00
encoding_info = f2fs_sb_read_encoding ( sbi - > raw_super ) ;
if ( ! encoding_info ) {
2019-07-23 16:05:28 -07:00
f2fs_err ( sbi ,
" Encoding requested by superblock is unknown " ) ;
return - EINVAL ;
}
2021-09-15 08:59:57 +02:00
encoding_flags = le16_to_cpu ( sbi - > raw_super - > s_encoding_flags ) ;
2019-07-23 16:05:28 -07:00
encoding = utf8_load ( encoding_info - > version ) ;
if ( IS_ERR ( encoding ) ) {
f2fs_err ( sbi ,
2021-09-15 09:00:00 +02:00
" can't mount with superblock charset: %s-%u.%u.%u "
2019-07-23 16:05:28 -07:00
" not supported by the kernel. flags: 0x%x. " ,
2021-09-15 09:00:00 +02:00
encoding_info - > name ,
unicode_major ( encoding_info - > version ) ,
unicode_minor ( encoding_info - > version ) ,
unicode_rev ( encoding_info - > version ) ,
2019-07-23 16:05:28 -07:00
encoding_flags ) ;
return PTR_ERR ( encoding ) ;
}
f2fs_info ( sbi , " Using encoding defined by superblock: "
2021-09-15 09:00:00 +02:00
" %s-%u.%u.%u with flags 0x%hx " , encoding_info - > name ,
unicode_major ( encoding_info - > version ) ,
unicode_minor ( encoding_info - > version ) ,
unicode_rev ( encoding_info - > version ) ,
encoding_flags ) ;
2019-07-23 16:05:28 -07:00
2020-07-08 02:12:36 -07:00
sbi - > sb - > s_encoding = encoding ;
sbi - > sb - > s_encoding_flags = encoding_flags ;
2019-07-23 16:05:28 -07:00
}
# else
if ( f2fs_sb_has_casefold ( sbi ) ) {
f2fs_err ( sbi , " Filesystem with casefold feature cannot be mounted without CONFIG_UNICODE " ) ;
return - EINVAL ;
}
# endif
return 0 ;
}
2018-02-22 14:09:30 -08:00
static void f2fs_tuning_parameters ( struct f2fs_sb_info * sbi )
{
/* adjust parameters according to the volume size */
2022-11-15 14:35:36 +08:00
if ( MAIN_SEGS ( sbi ) < = SMALL_VOLUME_SEGMENTS ) {
f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213877
[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response
[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.
[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory
[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
[5.] Most recent kernel version which did not have the bug:
None
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/oops-tracing.rst)
None
[7.] A small shell script or example program which triggers the
problem (if possible)
mount /dev/sdX /mnt/0X
[8.] Memory consumption
With 24 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 46 36 0 0 10 10
Swap: 0 0 0
With 3 * 14T SMR Block device with F2FS
free -g
total used free shared buff/cache available
Mem: 7 5 0 0 1 1
Swap: 7 0 7
The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.
For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.
This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.
Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().
In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 08:15:43 +08:00
if ( f2fs_block_unit_discard ( sbi ) )
2022-11-24 00:44:01 +08:00
SM_I ( sbi ) - > dcc_info - > discard_granularity =
MIN_DISCARD_GRANULARITY ;
2023-02-06 22:43:08 +08:00
if ( ! f2fs_lfs_mode ( sbi ) )
SM_I ( sbi ) - > ipu_policy = BIT ( F2FS_IPU_FORCE ) |
BIT ( F2FS_IPU_HONOR_OPU_WRITE ) ;
2018-02-22 14:09:30 -08:00
}
2018-06-11 18:02:01 +08:00
2022-11-15 14:35:37 +08:00
sbi - > readdir_ra = true ;
2018-02-22 14:09:30 -08:00
}
2012-11-02 17:07:47 +09:00
static int f2fs_fill_super ( struct super_block * sb , void * data , int silent )
{
struct f2fs_sb_info * sbi ;
2015-05-21 14:42:53 +08:00
struct f2fs_super_block * raw_super ;
2012-11-02 17:07:47 +09:00
struct inode * root ;
2016-05-11 17:08:14 +08:00
int err ;
2019-02-19 16:23:53 +08:00
bool skip_recovery = false , need_fsck = false ;
2015-01-23 17:41:39 -08:00
char * options = NULL ;
2015-12-15 17:19:26 +08:00
int recovery , i , valid_super_block ;
2016-01-27 09:57:30 +08:00
struct curseg_info * seg_i ;
2019-02-19 16:23:53 +08:00
int retry_cnt = 1 ;
2012-11-02 17:07:47 +09:00
2014-08-08 15:37:41 -07:00
try_onemore :
2015-05-21 14:42:53 +08:00
err = - EINVAL ;
raw_super = NULL ;
2015-12-15 17:19:26 +08:00
valid_super_block = - 1 ;
2015-05-21 14:42:53 +08:00
recovery = 0 ;
2012-11-02 17:07:47 +09:00
/* allocate memory for f2fs-specific super block info */
sbi = kzalloc ( sizeof ( struct f2fs_sb_info ) , GFP_KERNEL ) ;
if ( ! sbi )
return - ENOMEM ;
2016-03-23 17:05:27 -07:00
sbi - > sb = sb ;
2022-11-09 07:04:42 +09:00
/* initialize locks within allocated memory */
init_f2fs_rwsem ( & sbi - > gc_lock ) ;
mutex_init ( & sbi - > writepages ) ;
init_f2fs_rwsem ( & sbi - > cp_global_sem ) ;
init_f2fs_rwsem ( & sbi - > node_write ) ;
init_f2fs_rwsem ( & sbi - > node_change ) ;
spin_lock_init ( & sbi - > stat_lock ) ;
init_f2fs_rwsem ( & sbi - > cp_rwsem ) ;
init_f2fs_rwsem ( & sbi - > quota_sem ) ;
init_waitqueue_head ( & sbi - > cp_wait ) ;
spin_lock_init ( & sbi - > error_lock ) ;
for ( i = 0 ; i < NR_INODE_TYPE ; i + + ) {
INIT_LIST_HEAD ( & sbi - > inode_list [ i ] ) ;
spin_lock_init ( & sbi - > inode_lock [ i ] ) ;
}
mutex_init ( & sbi - > flush_lock ) ;
2016-03-02 12:04:24 -08:00
/* Load the checksum driver */
sbi - > s_chksum_driver = crypto_alloc_shash ( " crc32 " , 0 , 0 ) ;
if ( IS_ERR ( sbi - > s_chksum_driver ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot load crc32 driver. " ) ;
2016-03-02 12:04:24 -08:00
err = PTR_ERR ( sbi - > s_chksum_driver ) ;
sbi - > s_chksum_driver = NULL ;
goto free_sbi ;
}
2013-01-12 14:41:13 +09:00
/* set a block size */
2013-12-06 15:00:58 +09:00
if ( unlikely ( ! sb_set_blocksize ( sb , F2FS_BLKSIZE ) ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " unable to set blocksize " ) ;
2012-11-02 17:07:47 +09:00
goto free_sbi ;
2012-12-30 14:52:05 +09:00
}
2012-11-02 17:07:47 +09:00
2016-03-23 17:05:27 -07:00
err = read_raw_super_block ( sbi , & raw_super , & valid_super_block ,
2015-12-15 17:19:26 +08:00
& recovery ) ;
2013-10-14 18:47:11 +08:00
if ( err )
goto free_sbi ;
2013-06-07 14:16:53 +08:00
sb - > s_fs_info = sbi ;
2016-06-13 09:47:48 -07:00
sbi - > raw_super = raw_super ;
2022-11-09 07:04:42 +09:00
memcpy ( sbi - > errors , raw_super - > s_errors , MAX_F2FS_ERRORS ) ;
2017-07-31 20:19:09 +08:00
/* precompute checksum seed for metadata */
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_inode_chksum ( sbi ) )
2017-07-31 20:19:09 +08:00
sbi - > s_chksum_seed = f2fs_chksum ( sbi , ~ 0 , raw_super - > uuid ,
sizeof ( raw_super - > uuid ) ) ;
2015-05-07 18:11:37 +08:00
default_options ( sbi ) ;
2012-11-02 17:07:47 +09:00
/* parse mount options */
2015-01-23 17:41:39 -08:00
options = kstrdup ( ( const char * ) data , GFP_KERNEL ) ;
if ( data & & ! options ) {
err = - ENOMEM ;
2012-11-02 17:07:47 +09:00
goto free_sb_buf ;
2015-01-23 17:41:39 -08:00
}
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:32:50 -07:00
err = parse_options ( sb , options , false ) ;
2015-01-23 17:41:39 -08:00
if ( err )
goto free_options ;
2012-11-02 17:07:47 +09:00
2021-01-13 13:21:54 +08:00
sb - > s_maxbytes = max_file_blocks ( NULL ) < <
2015-12-31 14:35:37 +08:00
le32_to_cpu ( raw_super - > log_blocksize ) ;
2012-11-02 17:07:47 +09:00
sb - > s_max_links = F2FS_LINK_MAX ;
2019-07-23 16:05:28 -07:00
err = f2fs_setup_casefold ( sbi ) ;
if ( err )
goto free_options ;
2017-07-09 00:13:07 +08:00
# ifdef CONFIG_QUOTA
sb - > dq_op = & f2fs_quota_operations ;
2019-05-20 16:17:56 -07:00
sb - > s_qcop = & f2fs_quotactl_ops ;
2017-07-26 00:01:41 +08:00
sb - > s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ ;
2017-11-16 16:59:14 +08:00
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) ) {
2017-11-16 16:59:14 +08:00
for ( i = 0 ; i < MAXQUOTAS ; i + + ) {
if ( f2fs_qf_ino ( sbi - > sb , i ) )
sbi - > nquota_files + + ;
}
}
2017-07-09 00:13:07 +08:00
# endif
2012-11-02 17:07:47 +09:00
sb - > s_op = & f2fs_sops ;
2018-12-12 15:20:12 +05:30
# ifdef CONFIG_FS_ENCRYPTION
2015-05-15 16:26:10 -07:00
sb - > s_cop = & f2fs_cryptops ;
f2fs: add fs-verity support
Add fs-verity support to f2fs. fs-verity is a filesystem feature that
enables transparent integrity protection and authentication of read-only
files. It uses a dm-verity like mechanism at the file level: a Merkle
tree is used to verify any block in the file in log(filesize) time. It
is implemented mainly by helper functions in fs/verity/. See
Documentation/filesystems/fsverity.rst for the full documentation.
The f2fs support for fs-verity consists of:
- Adding a filesystem feature flag and an inode flag for fs-verity.
- Implementing the fsverity_operations to support enabling verity on an
inode and reading/writing the verity metadata.
- Updating ->readpages() to verify data as it's read from verity files
and to support reading verity metadata pages.
- Updating ->write_begin(), ->write_end(), and ->writepages() to support
writing verity metadata pages.
- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().
Like ext4, f2fs stores the verity metadata (Merkle tree and
fsverity_descriptor) past the end of the file, starting at the first 64K
boundary beyond i_size. This approach works because (a) verity files
are readonly, and (b) pages fully beyond i_size aren't visible to
userspace but can be read/written internally by f2fs with only some
relatively small changes to f2fs. Extended attributes cannot be used
because (a) f2fs limits the total size of an inode's xattr entries to
4096 bytes, which wouldn't be enough for even a single Merkle tree
block, and (b) f2fs encryption doesn't encrypt xattrs, yet the verity
metadata *must* be encrypted when the file is because it contains hashes
of the plaintext data.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-22 09:26:24 -07:00
# endif
# ifdef CONFIG_FS_VERITY
sb - > s_vop = & f2fs_verityops ;
2017-10-09 12:15:38 -07:00
# endif
2012-11-02 17:07:47 +09:00
sb - > s_xattr = f2fs_xattr_handlers ;
sb - > s_export_op = & f2fs_export_ops ;
sb - > s_magic = F2FS_SUPER_MAGIC ;
sb - > s_time_gran = 1 ;
2017-11-27 13:05:09 -08:00
sb - > s_flags = ( sb - > s_flags & ~ SB_POSIXACL ) |
( test_opt ( sbi , POSIX_ACL ) ? SB_POSIXACL : 0 ) ;
2017-05-10 15:06:33 +02:00
memcpy ( & sb - > s_uuid , raw_super - > uuid , sizeof ( raw_super - > uuid ) ) ;
2018-01-09 19:33:39 +08:00
sb - > s_iflags | = SB_I_CGROUPWB ;
2012-11-02 17:07:47 +09:00
/* init f2fs-specific super block info */
2015-12-15 17:19:26 +08:00
sbi - > valid_super_block = valid_super_block ;
2015-08-11 12:45:39 -07:00
/* disallow all the data/node/meta page writes */
set_sbi_flag ( sbi , SBI_POR_DOING ) ;
2013-11-18 17:16:17 +09:00
2022-05-25 17:43:36 +08:00
err = f2fs_init_write_merge_io ( sbi ) ;
if ( err )
goto free_bio_info ;
2013-11-18 17:16:17 +09:00
2012-11-02 17:07:47 +09:00
init_sb_info ( sbi ) ;
2021-08-19 20:52:28 -07:00
err = f2fs_init_iostat ( sbi ) ;
2016-05-13 12:36:58 -07:00
if ( err )
2018-01-17 16:31:35 +08:00
goto free_bio_info ;
2016-05-13 12:36:58 -07:00
err = init_percpu_info ( sbi ) ;
if ( err )
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
goto free_iostat ;
2016-05-13 12:36:58 -07:00
2019-07-12 16:55:42 +08:00
if ( F2FS_IO_ALIGNED ( sbi ) ) {
2016-12-14 10:12:56 -08:00
sbi - > write_io_dummy =
2017-02-27 18:43:13 +08:00
mempool_create_page_pool ( 2 * ( F2FS_IO_SIZE ( sbi ) - 1 ) , 0 ) ;
2017-06-12 09:44:27 +08:00
if ( ! sbi - > write_io_dummy ) {
err = - ENOMEM ;
2018-01-17 16:31:35 +08:00
goto free_percpu ;
2017-06-12 09:44:27 +08:00
}
2016-12-14 10:12:56 -08:00
}
2020-02-25 18:17:10 +08:00
/* init per sbi slab cache */
err = f2fs_init_xattr_caches ( sbi ) ;
if ( err )
goto free_io_dummy ;
2020-09-14 17:05:13 +08:00
err = f2fs_init_page_array_cache ( sbi ) ;
if ( err )
goto free_xattr_cache ;
2020-02-25 18:17:10 +08:00
2012-11-02 17:07:47 +09:00
/* get an inode for meta space */
sbi - > meta_inode = f2fs_iget ( sb , F2FS_META_INO ( sbi ) ) ;
if ( IS_ERR ( sbi - > meta_inode ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to read F2FS meta data inode " ) ;
2012-11-02 17:07:47 +09:00
err = PTR_ERR ( sbi - > meta_inode ) ;
2020-09-14 17:05:13 +08:00
goto free_page_array_cache ;
2012-11-02 17:07:47 +09:00
}
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_get_valid_checkpoint ( sbi ) ;
2012-12-30 14:52:05 +09:00
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to get valid F2FS checkpoint " ) ;
2012-11-02 17:07:47 +09:00
goto free_meta_inode ;
2012-12-30 14:52:05 +09:00
}
2012-11-02 17:07:47 +09:00
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
if ( __is_set_ckpt_flags ( F2FS_CKPT ( sbi ) , CP_QUOTA_NEED_FSCK_FLAG ) )
set_sbi_flag ( sbi , SBI_QUOTA_NEED_REPAIR ) ;
2019-01-24 17:48:38 -08:00
if ( __is_set_ckpt_flags ( F2FS_CKPT ( sbi ) , CP_DISABLED_QUICK_FLAG ) ) {
set_sbi_flag ( sbi , SBI_CP_DISABLED_QUICK ) ;
sbi - > interval_time [ DISABLE_TIME ] = DEF_DISABLE_QUICK_INTERVAL ;
}
f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 20:05:00 +08:00
2019-06-05 11:33:25 +08:00
if ( __is_set_ckpt_flags ( F2FS_CKPT ( sbi ) , CP_FSCK_FLAG ) )
set_sbi_flag ( sbi , SBI_NEED_FSCK ) ;
2016-10-06 19:02:05 -07:00
/* Initialize device list */
err = f2fs_scan_devices ( sbi ) ;
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to find devices " ) ;
2016-10-06 19:02:05 -07:00
goto free_devices ;
}
f2fs: support data compression
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
20200117
- fix to avoid NULL pointer dereference
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-01 18:07:14 +08:00
err = f2fs_init_post_read_wq ( sbi ) ;
if ( err ) {
f2fs_err ( sbi , " Failed to initialize post read workqueue " ) ;
goto free_devices ;
}
2012-11-02 17:07:47 +09:00
sbi - > total_valid_node_count =
le32_to_cpu ( sbi - > ckpt - > valid_node_count ) ;
2016-05-16 11:42:32 -07:00
percpu_counter_set ( & sbi - > total_valid_inode_count ,
le32_to_cpu ( sbi - > ckpt - > valid_inode_count ) ) ;
2012-11-02 17:07:47 +09:00
sbi - > user_block_count = le64_to_cpu ( sbi - > ckpt - > user_block_count ) ;
sbi - > total_valid_block_count =
le64_to_cpu ( sbi - > ckpt - > valid_block_count ) ;
sbi - > last_valid_block_count = sbi - > total_valid_block_count ;
2017-06-26 16:24:41 +08:00
sbi - > reserved_blocks = 0 ;
2017-10-27 20:45:05 +08:00
sbi - > current_reserved_blocks = 0 ;
2017-12-27 15:05:52 -08:00
limit_reserve_root ( sbi ) ;
2020-05-15 17:20:50 -07:00
adjust_unusable_cap_perc ( sbi ) ;
2016-05-16 11:06:50 -07:00
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_init_extent_cache_info ( sbi ) ;
f2fs: enable rb-tree extent cache
This patch enables rb-tree based extent cache in f2fs.
When we mount with "-o extent_cache", f2fs will try to add recently accessed
page-block mappings into rb-tree based extent cache as much as possible, instead
of original one extent info cache.
By this way, f2fs can support more effective cache between dnode page cache and
disk. It will supply high hit ratio in the cache with fewer memory when dnode
page cache are reclaimed in environment of low memory.
Storage: Sandisk sd card 64g
1.append write file (offset: 0, size: 128M);
2.override write file (offset: 2M, size: 1M);
3.override write file (offset: 4M, size: 1M);
...
4.override write file (offset: 48M, size: 1M);
...
5.override write file (offset: 112M, size: 1M);
6.sync
7.echo 3 > /proc/sys/vm/drop_caches
8.read file (size:128M, unit: 4k, count: 32768)
(time dd if=/mnt/f2fs/128m bs=4k count=32768)
Extent Hit Ratio:
before patched
Hit Ratio 121 / 1071 1071 / 1071
Performance:
before patched
real 0m37.051s 0m35.556s
user 0m0.040s 0m0.026s
sys 0m2.990s 0m2.251s
Memory Cost:
before patched
Tree Count: 0 1 (size: 24 bytes)
Node Count: 0 45 (size: 1440 bytes)
v3:
o retest and given more details of test result.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-05 17:57:31 +08:00
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_init_ino_entry_info ( sbi ) ;
2012-11-02 17:07:47 +09:00
f2fs: fix to avoid broken of dnode block list
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.
By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.
Sheng Yong helps to do the test with this patch:
Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8
1 / PERSIST / Index
Base:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 867.82 204.15 41440.03 41370.54 680.8 1025.94 1031.08
2 871.87 205.87 41370.3 40275.2 791.14 1065.84 1101.7
3 866.52 205.69 41795.67 40596.16 694.69 1037.16 1031.48
Avg 868.7366667 205.2366667 41535.33333 40747.3 722.21 1042.98 1054.753333
After:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 798.81 202.5 41143 40613.87 602.71 838.08 913.83
2 805.79 206.47 40297.2 41291.46 604.44 840.75 924.27
3 814.83 206.17 41209.57 40453.62 602.85 834.66 927.91
Avg 806.4766667 205.0466667 40883.25667 40786.31667 603.3333333 837.83 922.0033333
Patched/Original:
0.928332713 0.999074239 0.984300676 1.000957528 0.835398753 0.803303994 0.874141189
It looks like atomic write will suffer performance regression.
I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.
BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:
- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
- writeback data: P1, P2, P3;
- writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
- preflush + fua;
- power-cut
If we don't wait dnode writeback for atomic_write:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 779.91 206.03 41621.5 40333.16 716.9 1038.21 1034.85
2 848.51 204.35 40082.44 39486.17 791.83 1119.96 1083.77
3 772.12 206.27 41335.25 41599.65 723.29 1055.07 971.92
Avg 800.18 205.55 41013.06333 40472.99333 744.0066667 1071.08 1030.18
Patched/Original:
0.92108464 1.001526693 0.987425886 0.993268102 1.030180511 1.026942031 0.976702294
SQLite's performance recovers.
Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."
Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-02 23:03:19 +08:00
f2fs_init_fsync_node_info ( sbi ) ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
/* setup checkpoint request control and start checkpoint issue thread */
f2fs_init_ckpt_req_control ( sbi ) ;
2021-03-17 17:56:03 +08:00
if ( ! f2fs_readonly ( sb ) & & ! test_opt ( sbi , DISABLE_CHECKPOINT ) & &
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
test_opt ( sbi , MERGE_CHECKPOINT ) ) {
err = f2fs_start_ckpt_thread ( sbi ) ;
if ( err ) {
f2fs_err ( sbi ,
" Failed to start F2FS issue_checkpoint_thread (%d) " ,
err ) ;
goto stop_ckpt_thread ;
}
}
2012-11-02 17:07:47 +09:00
/* setup f2fs internal modules */
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_build_segment_manager ( sbi ) ;
2012-12-30 14:52:05 +09:00
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to initialize F2FS segment manager (%d) " ,
err ) ;
2012-11-02 17:07:47 +09:00
goto free_sm ;
2012-12-30 14:52:05 +09:00
}
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_build_node_manager ( sbi ) ;
2012-12-30 14:52:05 +09:00
if ( err ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to initialize F2FS node manager (%d) " ,
err ) ;
2012-11-02 17:07:47 +09:00
goto free_nm ;
2012-12-30 14:52:05 +09:00
}
2012-11-02 17:07:47 +09:00
f2fs: fix to reserve space for IO align feature
https://bugzilla.kernel.org/show_bug.cgi?id=204137
With below script, we will hit panic during new segment allocation:
DISK=bingo.img
MOUNT_DIR=/mnt/f2fs
dd if=/dev/zero of=$DISK bs=1M count=105
mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK
mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict"
for (( i = 0; i < 4096; i++ )); do
name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10`
mkdir $MOUNT_DIR/$name
done
umount $MOUNT_DIR
rm $DISK
--- Core dump ---
Call Trace:
allocate_segment_by_default+0x9d/0x100 [f2fs]
f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs]
do_write_page+0x62/0x110 [f2fs]
f2fs_outplace_write_data+0x43/0xc0 [f2fs]
f2fs_do_write_data_page+0x386/0x560 [f2fs]
__write_data_page+0x706/0x850 [f2fs]
f2fs_write_cache_pages+0x267/0x6a0 [f2fs]
f2fs_write_data_pages+0x19c/0x2e0 [f2fs]
do_writepages+0x1c/0x70
__filemap_fdatawrite_range+0xaa/0xe0
filemap_fdatawrite+0x1f/0x30
f2fs_sync_dirty_inodes+0x74/0x1f0 [f2fs]
block_operations+0xdc/0x350 [f2fs]
f2fs_write_checkpoint+0x104/0x1150 [f2fs]
f2fs_sync_fs+0xa2/0x120 [f2fs]
f2fs_balance_fs_bg+0x33c/0x390 [f2fs]
f2fs_write_node_pages+0x4c/0x1f0 [f2fs]
do_writepages+0x1c/0x70
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x273/0x5c0
wb_writeback+0xff/0x2e0
wb_workfn+0xa1/0x370
process_one_work+0x138/0x350
worker_thread+0x4d/0x3d0
kthread+0x109/0x140
ret_from_fork+0x25/0x30
The root cause here is, with IO alignment feature enables, in worst
case, we need F2FS_IO_SIZE() free blocks space for single one 4k write
due to IO alignment feature will fill dummy pages to make IO being
aligned.
So we will easily run out of free segments during non-inline directory's
data writeback, even in process of foreground GC.
In order to fix this issue, I just propose to reserve additional free
space for IO alignment feature to handle worst case of free space usage
ratio during FGGC.
Fixes: 0a595ebaaa6b ("f2fs: support IO alignment for DATA and NODE writes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-11 21:27:36 +08:00
err = adjust_reserved_segment ( sbi ) ;
if ( err )
goto free_nm ;
2016-01-27 09:57:30 +08:00
/* For write statistics */
2020-11-27 21:20:06 +08:00
sbi - > sectors_written_start = f2fs_get_sectors_written ( sbi ) ;
2016-01-27 09:57:30 +08:00
/* Read accumulated write IO statistics if exists */
seg_i = CURSEG_I ( sbi , CURSEG_HOT_NODE ) ;
if ( __exist_node_summaries ( sbi ) )
sbi - > kbytes_written =
2016-03-29 18:00:15 +08:00
le64_to_cpu ( seg_i - > journal - > info . kbytes_written ) ;
2016-01-27 09:57:30 +08:00
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_build_gc_manager ( sbi ) ;
2012-11-02 17:07:47 +09:00
2018-12-26 11:20:29 +05:30
err = f2fs_build_stats ( sbi ) ;
if ( err )
goto free_nm ;
2012-11-02 17:07:47 +09:00
/* get an inode for node space */
sbi - > node_inode = f2fs_iget ( sb , F2FS_NODE_INO ( sbi ) ) ;
if ( IS_ERR ( sbi - > node_inode ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to read node inode " ) ;
2012-11-02 17:07:47 +09:00
err = PTR_ERR ( sbi - > node_inode ) ;
2018-12-26 11:20:29 +05:30
goto free_stats ;
2012-11-02 17:07:47 +09:00
}
/* read root inode and dentry */
root = f2fs_iget ( sb , F2FS_ROOT_INO ( sbi ) ) ;
if ( IS_ERR ( root ) ) {
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Failed to read root inode " ) ;
2012-11-02 17:07:47 +09:00
err = PTR_ERR ( root ) ;
2018-12-26 11:20:29 +05:30
goto free_node_inode ;
2012-11-02 17:07:47 +09:00
}
f2fs: fix to do sanity check with inline flags
https://bugzilla.kernel.org/show_bug.cgi?id=200221
- Overview
BUG() in clear_inode() when mounting and un-mounting a corrupted f2fs image
- Reproduce
- Kernel message
[ 538.601448] F2FS-fs (loop0): Invalid segment/section count (31, 24 x 1376257)
[ 538.601458] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 538.724091] F2FS-fs (loop0): Try to recover 2th superblock, ret: 0
[ 538.724102] F2FS-fs (loop0): Mounted with checkpoint version = 2
[ 540.970834] ------------[ cut here ]------------
[ 540.970838] kernel BUG at fs/inode.c:512!
[ 540.971750] invalid opcode: 0000 [#1] SMP KASAN PTI
[ 540.972755] CPU: 1 PID: 1305 Comm: umount Not tainted 4.18.0-rc1+ #4
[ 540.974034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 540.982913] RIP: 0010:clear_inode+0xc0/0xd0
[ 540.983774] Code: 8d a3 30 01 00 00 4c 89 e7 e8 1c ec f8 ff 48 8b 83 30 01 00 00 49 39 c4 75 1a 48 c7 83 a0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 66 66 66 66 90 55
[ 540.987570] RSP: 0018:ffff8801e34a7b70 EFLAGS: 00010002
[ 540.988636] RAX: 0000000000000000 RBX: ffff8801e9b744e8 RCX: ffffffffb840eb3a
[ 540.990063] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8801e9b746b8
[ 540.991499] RBP: ffff8801e34a7b80 R08: ffffed003d36e8ce R09: ffffed003d36e8ce
[ 540.992923] R10: 0000000000000001 R11: ffffed003d36e8cd R12: ffff8801e9b74668
[ 540.994360] R13: ffff8801e9b74760 R14: ffff8801e9b74528 R15: ffff8801e9b74530
[ 540.995786] FS: 00007f4662bdf840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 540.997403] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 540.998571] CR2: 000000000175c568 CR3: 00000001dcfe6000 CR4: 00000000000006e0
[ 541.000015] Call Trace:
[ 541.000554] f2fs_evict_inode+0x253/0x630
[ 541.001381] evict+0x16f/0x290
[ 541.002015] iput+0x280/0x300
[ 541.002654] dentry_unlink_inode+0x165/0x1e0
[ 541.003528] __dentry_kill+0x16a/0x260
[ 541.004300] dentry_kill+0x70/0x250
[ 541.005018] dput+0x154/0x1d0
[ 541.005635] do_one_tree+0x34/0x40
[ 541.006354] shrink_dcache_for_umount+0x3f/0xa0
[ 541.007285] generic_shutdown_super+0x43/0x1c0
[ 541.008192] kill_block_super+0x52/0x80
[ 541.008978] kill_f2fs_super+0x62/0x70
[ 541.009750] deactivate_locked_super+0x6f/0xa0
[ 541.010664] deactivate_super+0x5e/0x80
[ 541.011450] cleanup_mnt+0x61/0xa0
[ 541.012151] __cleanup_mnt+0x12/0x20
[ 541.012893] task_work_run+0xc8/0xf0
[ 541.013635] exit_to_usermode_loop+0x125/0x130
[ 541.014555] do_syscall_64+0x138/0x170
[ 541.015340] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 541.016375] RIP: 0033:0x7f46624bf487
[ 541.017104] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[ 541.020923] RSP: 002b:00007fff5e12e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 541.022452] RAX: 0000000000000000 RBX: 0000000001753030 RCX: 00007f46624bf487
[ 541.023885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000175a1e0
[ 541.025318] RBP: 000000000175a1e0 R08: 0000000000000000 R09: 0000000000000014
[ 541.026755] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f46629c883c
[ 541.028186] R13: 0000000000000000 R14: 0000000001753210 R15: 00007fff5e12ec30
[ 541.029626] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[ 541.039445] ---[ end trace 4ce02f25ff7d3df5 ]---
[ 541.040392] RIP: 0010:clear_inode+0xc0/0xd0
[ 541.041240] Code: 8d a3 30 01 00 00 4c 89 e7 e8 1c ec f8 ff 48 8b 83 30 01 00 00 49 39 c4 75 1a 48 c7 83 a0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 66 66 66 66 90 55
[ 541.045042] RSP: 0018:ffff8801e34a7b70 EFLAGS: 00010002
[ 541.046099] RAX: 0000000000000000 RBX: ffff8801e9b744e8 RCX: ffffffffb840eb3a
[ 541.047537] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8801e9b746b8
[ 541.048965] RBP: ffff8801e34a7b80 R08: ffffed003d36e8ce R09: ffffed003d36e8ce
[ 541.050402] R10: 0000000000000001 R11: ffffed003d36e8cd R12: ffff8801e9b74668
[ 541.051832] R13: ffff8801e9b74760 R14: ffff8801e9b74528 R15: ffff8801e9b74530
[ 541.053263] FS: 00007f4662bdf840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[ 541.054891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 541.056039] CR2: 000000000175c568 CR3: 00000001dcfe6000 CR4: 00000000000006e0
[ 541.058506] ==================================================================
[ 541.059991] BUG: KASAN: stack-out-of-bounds in update_stack_state+0x38c/0x3e0
[ 541.061513] Read of size 8 at addr ffff8801e34a7970 by task umount/1305
[ 541.063302] CPU: 1 PID: 1305 Comm: umount Tainted: G D 4.18.0-rc1+ #4
[ 541.064838] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 541.066778] Call Trace:
[ 541.067294] dump_stack+0x7b/0xb5
[ 541.067986] print_address_description+0x70/0x290
[ 541.068941] kasan_report+0x291/0x390
[ 541.069692] ? update_stack_state+0x38c/0x3e0
[ 541.070598] __asan_load8+0x54/0x90
[ 541.071315] update_stack_state+0x38c/0x3e0
[ 541.072172] ? __read_once_size_nocheck.constprop.7+0x20/0x20
[ 541.073340] ? vprintk_func+0x27/0x60
[ 541.074096] ? printk+0xa3/0xd3
[ 541.074762] ? __save_stack_trace+0x5e/0x100
[ 541.075634] unwind_next_frame.part.5+0x18e/0x490
[ 541.076594] ? unwind_dump+0x290/0x290
[ 541.077368] ? __show_regs+0x2c4/0x330
[ 541.078142] __unwind_start+0x106/0x190
[ 541.085422] __save_stack_trace+0x5e/0x100
[ 541.086268] ? __save_stack_trace+0x5e/0x100
[ 541.087161] ? unlink_anon_vmas+0xba/0x2c0
[ 541.087997] save_stack_trace+0x1f/0x30
[ 541.088782] save_stack+0x46/0xd0
[ 541.089475] ? __alloc_pages_slowpath+0x1420/0x1420
[ 541.090477] ? flush_tlb_mm_range+0x15e/0x220
[ 541.091364] ? __dec_node_state+0x24/0xb0
[ 541.092180] ? lock_page_memcg+0x85/0xf0
[ 541.092979] ? unlock_page_memcg+0x16/0x80
[ 541.093812] ? page_remove_rmap+0x198/0x520
[ 541.094674] ? mark_page_accessed+0x133/0x200
[ 541.095559] ? _cond_resched+0x1a/0x50
[ 541.096326] ? unmap_page_range+0xcd4/0xe50
[ 541.097179] ? rb_next+0x58/0x80
[ 541.097845] ? rb_next+0x58/0x80
[ 541.098518] __kasan_slab_free+0x13c/0x1a0
[ 541.099352] ? unlink_anon_vmas+0xba/0x2c0
[ 541.100184] kasan_slab_free+0xe/0x10
[ 541.100934] kmem_cache_free+0x89/0x1e0
[ 541.101724] unlink_anon_vmas+0xba/0x2c0
[ 541.102534] free_pgtables+0x101/0x1b0
[ 541.103299] exit_mmap+0x146/0x2a0
[ 541.103996] ? __ia32_sys_munmap+0x50/0x50
[ 541.104829] ? kasan_check_read+0x11/0x20
[ 541.105649] ? mm_update_next_owner+0x322/0x380
[ 541.106578] mmput+0x8b/0x1d0
[ 541.107191] do_exit+0x43a/0x1390
[ 541.107876] ? mm_update_next_owner+0x380/0x380
[ 541.108791] ? deactivate_super+0x5e/0x80
[ 541.109610] ? cleanup_mnt+0x61/0xa0
[ 541.110351] ? __cleanup_mnt+0x12/0x20
[ 541.111115] ? task_work_run+0xc8/0xf0
[ 541.111879] ? exit_to_usermode_loop+0x125/0x130
[ 541.112817] rewind_stack_do_exit+0x17/0x20
[ 541.113666] RIP: 0033:0x7f46624bf487
[ 541.114404] Code: Bad RIP value.
[ 541.115094] RSP: 002b:00007fff5e12e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 541.116605] RAX: 0000000000000000 RBX: 0000000001753030 RCX: 00007f46624bf487
[ 541.118034] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000175a1e0
[ 541.119472] RBP: 000000000175a1e0 R08: 0000000000000000 R09: 0000000000000014
[ 541.120890] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f46629c883c
[ 541.122321] R13: 0000000000000000 R14: 0000000001753210 R15: 00007fff5e12ec30
[ 541.124061] The buggy address belongs to the page:
[ 541.125042] page:ffffea00078d29c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 541.126651] flags: 0x2ffff0000000000()
[ 541.127418] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[ 541.128963] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 541.130516] page dumped because: kasan: bad access detected
[ 541.131954] Memory state around the buggy address:
[ 541.132924] ffff8801e34a7800: 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00 00
[ 541.134378] ffff8801e34a7880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 541.135814] >ffff8801e34a7900: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
[ 541.137253] ^
[ 541.138637] ffff8801e34a7980: f1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 541.140075] ffff8801e34a7a00: 00 00 00 00 00 00 00 00 f3 00 00 00 00 00 00 00
[ 541.141509] ==================================================================
- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/inode.c#L512
BUG_ON(inode->i_data.nrpages);
The root cause is root directory inode is corrupted, it has both
inline_data and inline_dentry flag, and its nlink is zero, so in
->evict(), after dropping all page cache, it grabs page #0 for inline
data truncation, result in panic in later clear_inode() where we will
check inode->i_data.nrpages value.
This patch adds inline flags check in sanity_check_inode, in addition,
do sanity check with root inode's nlink.
Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-29 00:19:25 +08:00
if ( ! S_ISDIR ( root - > i_mode ) | | ! root - > i_blocks | |
! root - > i_size | | ! root - > i_nlink ) {
2014-07-25 12:55:09 +08:00
iput ( root ) ;
2013-11-28 15:43:43 +08:00
err = - EINVAL ;
2018-12-26 11:20:29 +05:30
goto free_node_inode ;
2013-11-28 15:43:43 +08:00
}
2012-11-02 17:07:47 +09:00
sb - > s_root = d_make_root ( root ) ; /* allocate root dentry */
if ( ! sb - > s_root ) {
err = - ENOMEM ;
2019-01-23 15:49:44 +08:00
goto free_node_inode ;
2012-11-02 17:07:47 +09:00
}
2021-05-20 19:51:50 +08:00
err = f2fs_init_compress_inode ( sbi ) ;
2013-08-04 23:09:40 +09:00
if ( err )
2017-06-14 17:39:46 +08:00
goto free_root_inode ;
2013-08-04 23:09:40 +09:00
2021-05-20 19:51:50 +08:00
err = f2fs_register_sysfs ( sbi ) ;
if ( err )
goto free_compress_inode ;
2017-10-06 09:14:28 -07:00
# ifdef CONFIG_QUOTA
2018-07-26 19:24:25 +08:00
/* Enable quota usage during mount */
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) & & ! f2fs_readonly ( sb ) ) {
2017-10-06 09:14:28 -07:00
err = f2fs_enable_quotas ( sb ) ;
2018-10-02 17:20:58 -07:00
if ( err )
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot turn on quotas: error %d " , err ) ;
2017-10-06 09:14:28 -07:00
}
# endif
2020-02-27 19:30:05 +08:00
/* if there are any orphan inodes, free them */
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_recover_orphan_inodes ( sbi ) ;
2017-08-08 10:54:31 +08:00
if ( err )
2017-10-06 09:14:28 -07:00
goto free_meta ;
2017-08-08 10:54:31 +08:00
2018-08-20 19:21:43 -07:00
if ( unlikely ( is_set_ckpt_flags ( sbi , CP_DISABLED_FLAG ) ) )
2019-02-19 16:23:53 +08:00
goto reset_checkpoint ;
2018-08-20 19:21:43 -07:00
2014-02-19 18:23:32 +09:00
/* recover fsynced data */
2020-02-14 17:45:11 +08:00
if ( ! test_opt ( sbi , DISABLE_ROLL_FORWARD ) & &
! test_opt ( sbi , NORECOVERY ) ) {
2015-01-23 19:16:59 -08:00
/*
* mount should be failed , when device has readonly mode , and
* previous checkpoint was not done by clean system shutdown .
*/
2019-04-22 20:22:37 +08:00
if ( f2fs_hw_is_readonly ( sbi ) ) {
2021-03-31 11:16:32 +08:00
if ( ! is_set_ckpt_flags ( sbi , CP_UMOUNT_FLAG ) ) {
err = f2fs_recover_fsync_data ( sbi , true ) ;
if ( err > 0 ) {
err = - EROFS ;
f2fs_err ( sbi , " Need to recover fsync data, but "
" write access unavailable, please try "
" mount w/ disable_roll_forward or norecovery " ) ;
}
if ( err < 0 )
goto free_meta ;
}
f2fs_info ( sbi , " write access unavailable, skipping recovery " ) ;
2019-04-22 20:22:37 +08:00
goto reset_checkpoint ;
2015-01-23 19:16:59 -08:00
}
2015-03-16 21:08:44 +08:00
if ( need_fsck )
set_sbi_flag ( sbi , SBI_NEED_FSCK ) ;
2019-02-19 16:23:53 +08:00
if ( skip_recovery )
goto reset_checkpoint ;
2016-09-19 17:55:10 -07:00
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_recover_fsync_data ( sbi , false ) ;
2016-03-23 16:12:58 -07:00
if ( err < 0 ) {
2019-02-19 16:23:53 +08:00
if ( err ! = - ENOMEM )
skip_recovery = true ;
2015-03-16 21:08:44 +08:00
need_fsck = true ;
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Cannot recover all fsync data errno=%d " ,
err ) ;
2017-08-08 10:54:31 +08:00
goto free_meta ;
2014-08-08 15:37:41 -07:00
}
2016-03-23 16:12:58 -07:00
} else {
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_recover_fsync_data ( sbi , true ) ;
2016-03-23 16:12:58 -07:00
if ( ! f2fs_readonly ( sb ) & & err > 0 ) {
err = - EINVAL ;
2019-06-18 17:48:42 +08:00
f2fs_err ( sbi , " Need to recover fsync data " ) ;
2017-10-06 09:14:28 -07:00
goto free_meta ;
2016-03-23 16:12:58 -07:00
}
2014-02-19 18:23:32 +09:00
}
2019-12-09 19:44:45 +09:00
/*
* If the f2fs is not readonly and fsync data recovery succeeds ,
* check zoned block devices ' write pointer consistency .
*/
if ( ! err & & ! f2fs_readonly ( sb ) & & f2fs_sb_has_blkzoned ( sbi ) ) {
err = f2fs_check_write_pointer ( sbi ) ;
if ( err )
goto free_meta ;
}
2019-02-19 16:23:53 +08:00
reset_checkpoint :
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
f2fs_init_inmem_curseg ( sbi ) ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
/* f2fs_recover_fsync_data() cleared this already */
2015-08-11 12:45:39 -07:00
clear_sbi_flag ( sbi , SBI_POR_DOING ) ;
2013-08-04 23:09:40 +09:00
2018-08-20 19:21:43 -07:00
if ( test_opt ( sbi , DISABLE_CHECKPOINT ) ) {
err = f2fs_disable_checkpoint ( sbi ) ;
if ( err )
2019-01-22 14:04:33 -08:00
goto sync_free_meta ;
2018-08-20 19:21:43 -07:00
} else if ( is_set_ckpt_flags ( sbi , CP_DISABLED_FLAG ) ) {
f2fs_enable_checkpoint ( sbi ) ;
}
2014-02-19 18:23:32 +09:00
/*
* If filesystem is not mounted as read - only then
* do start the gc_thread .
*/
2021-03-27 17:57:06 +08:00
if ( ( F2FS_OPTION ( sbi ) . bggc_mode ! = BGGC_MODE_OFF | |
test_opt ( sbi , GC_MERGE ) ) & & ! f2fs_readonly ( sb ) ) {
2014-02-19 18:23:32 +09:00
/* After POR, we can run background GC thread.*/
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_start_gc_thread ( sbi ) ;
2014-02-19 18:23:32 +09:00
if ( err )
2019-01-22 14:04:33 -08:00
goto sync_free_meta ;
2014-02-19 18:23:32 +09:00
}
2018-12-13 18:38:33 -08:00
kvfree ( options ) ;
2015-05-21 14:42:53 +08:00
/* recover broken superblock */
2016-03-23 10:42:01 -07:00
if ( recovery ) {
2016-02-22 18:33:20 +08:00
err = f2fs_commit_super ( sbi , true ) ;
2019-06-18 17:48:42 +08:00
f2fs_info ( sbi , " Try to recover %dth superblock, ret: %d " ,
sbi - > valid_super_block ? 1 : 2 , err ) ;
2015-05-21 14:42:53 +08:00
}
2017-11-30 19:28:20 +08:00
f2fs_join_shrinker ( sbi ) ;
2018-02-22 14:09:30 -08:00
f2fs_tuning_parameters ( sbi ) ;
2019-06-18 17:48:42 +08:00
f2fs_notice ( sbi , " Mounted with checkpoint version = %llx " ,
cur_cp_version ( F2FS_CKPT ( sbi ) ) ) ;
2016-01-08 15:51:50 -08:00
f2fs_update_time ( sbi , CP_TIME ) ;
2016-01-08 16:57:48 -08:00
f2fs_update_time ( sbi , REQ_TIME ) ;
2019-01-24 17:48:38 -08:00
clear_sbi_flag ( sbi , SBI_CP_DISABLED_QUICK ) ;
2012-11-02 17:07:47 +09:00
return 0 ;
2014-02-19 18:23:32 +09:00
2019-01-22 14:04:33 -08:00
sync_free_meta :
/* safe to flush all the data */
sync_filesystem ( sbi - > sb ) ;
2019-02-19 16:23:53 +08:00
retry_cnt = 0 ;
2019-01-22 14:04:33 -08:00
2017-08-08 10:54:31 +08:00
free_meta :
2017-10-06 09:14:28 -07:00
# ifdef CONFIG_QUOTA
2018-10-12 18:49:26 +08:00
f2fs_truncate_quota_inode_pages ( sb ) ;
2018-10-24 18:34:26 +08:00
if ( f2fs_sb_has_quota_ino ( sbi ) & & ! f2fs_readonly ( sb ) )
2017-10-06 09:14:28 -07:00
f2fs_quota_off_umount ( sbi - > sb ) ;
# endif
2017-08-08 10:54:31 +08:00
/*
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
* Some dirty meta pages can be produced by f2fs_recover_orphan_inodes ( )
2017-08-08 10:54:31 +08:00
* failed by EIO . Then , iput ( node_inode ) can trigger balance_fs_bg ( )
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
* followed by f2fs_write_checkpoint ( ) through f2fs_write_node_pages ( ) , which
* falls into an infinite loop in f2fs_sync_meta_pages ( ) .
2017-08-08 10:54:31 +08:00
*/
truncate_inode_pages_final ( META_MAPPING ( sbi ) ) ;
2019-01-22 14:04:33 -08:00
/* evict some inodes being cached by GC */
evict_inodes ( sb ) ;
2017-07-26 11:24:13 -07:00
f2fs_unregister_sysfs ( sbi ) ;
2021-05-20 19:51:50 +08:00
free_compress_inode :
f2fs_destroy_compress_inode ( sbi ) ;
2012-11-02 17:07:47 +09:00
free_root_inode :
dput ( sb - > s_root ) ;
sb - > s_root = NULL ;
free_node_inode :
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_release_ino_entry ( sbi , true ) ;
2017-11-30 19:28:20 +08:00
truncate_inode_pages_final ( NODE_MAPPING ( sbi ) ) ;
2012-11-02 17:07:47 +09:00
iput ( sbi - > node_inode ) ;
2019-01-01 00:11:30 -08:00
sbi - > node_inode = NULL ;
2018-12-26 11:20:29 +05:30
free_stats :
f2fs_destroy_stats ( sbi ) ;
2012-11-02 17:07:47 +09:00
free_nm :
2021-11-04 16:22:01 +08:00
/* stop discard thread before destroying node manager */
f2fs_stop_discard_thread ( sbi ) ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_node_manager ( sbi ) ;
2012-11-02 17:07:47 +09:00
free_sm :
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_segment_manager ( sbi ) ;
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
We've measured thread A's execution time.
[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/ patch ]
Elapsed Time: Avg. 48 seconds
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-19 09:00:42 +09:00
stop_ckpt_thread :
f2fs_stop_ckpt_thread ( sbi ) ;
2022-10-28 17:30:26 +08:00
f2fs_destroy_post_read_wq ( sbi ) ;
2016-10-06 19:02:05 -07:00
free_devices :
destroy_device_list ( sbi ) ;
2018-12-13 18:38:33 -08:00
kvfree ( sbi - > ckpt ) ;
2012-11-02 17:07:47 +09:00
free_meta_inode :
make_bad_inode ( sbi - > meta_inode ) ;
iput ( sbi - > meta_inode ) ;
2019-01-01 00:11:30 -08:00
sbi - > meta_inode = NULL ;
2020-09-14 17:05:13 +08:00
free_page_array_cache :
f2fs_destroy_page_array_cache ( sbi ) ;
2020-02-25 18:17:10 +08:00
free_xattr_cache :
f2fs_destroy_xattr_caches ( sbi ) ;
2016-12-14 10:12:56 -08:00
free_io_dummy :
mempool_destroy ( sbi - > write_io_dummy ) ;
2018-01-17 16:31:35 +08:00
free_percpu :
destroy_percpu_info ( sbi ) ;
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
free_iostat :
f2fs_destroy_iostat ( sbi ) ;
2018-01-17 16:31:35 +08:00
free_bio_info :
2017-05-10 11:18:25 -07:00
for ( i = 0 ; i < NR_PAGE_TYPE ; i + + )
2018-12-13 18:38:33 -08:00
kvfree ( sbi - > write_io [ i ] ) ;
2019-07-23 16:05:28 -07:00
2022-01-18 07:56:14 +01:00
# if IS_ENABLED(CONFIG_UNICODE)
2020-07-08 02:12:36 -07:00
utf8_unload ( sb - > s_encoding ) ;
2020-11-12 18:14:54 +09:00
sb - > s_encoding = NULL ;
2019-07-23 16:05:28 -07:00
# endif
2018-01-17 16:31:35 +08:00
free_options :
2017-08-08 10:54:31 +08:00
# ifdef CONFIG_QUOTA
for ( i = 0 ; i < MAXQUOTAS ; i + + )
2020-06-17 20:30:12 +08:00
kfree ( F2FS_OPTION ( sbi ) . s_qf_names [ i ] ) ;
2017-08-08 10:54:31 +08:00
# endif
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-16 21:11:35 -07:00
fscrypt_free_dummy_policy ( & F2FS_OPTION ( sbi ) . dummy_enc_policy ) ;
2018-12-13 18:38:33 -08:00
kvfree ( options ) ;
2012-11-02 17:07:47 +09:00
free_sb_buf :
2020-06-10 01:14:46 +03:00
kfree ( raw_super ) ;
2012-11-02 17:07:47 +09:00
free_sbi :
2016-03-02 12:04:24 -08:00
if ( sbi - > s_chksum_driver )
crypto_free_shash ( sbi - > s_chksum_driver ) ;
2020-06-10 01:14:46 +03:00
kfree ( sbi ) ;
2014-08-08 15:37:41 -07:00
/* give only one another chance */
2019-02-19 16:23:53 +08:00
if ( retry_cnt > 0 & & skip_recovery ) {
retry_cnt - - ;
2014-08-08 15:37:41 -07:00
shrink_dcache_sb ( sb ) ;
goto try_onemore ;
}
2012-11-02 17:07:47 +09:00
return err ;
}
static struct dentry * f2fs_mount ( struct file_system_type * fs_type , int flags ,
const char * dev_name , void * data )
{
return mount_bdev ( fs_type , flags , dev_name , data , f2fs_fill_super ) ;
}
2015-01-14 16:34:24 -08:00
static void kill_f2fs_super ( struct super_block * sb )
{
2017-06-29 23:17:45 +08:00
if ( sb - > s_root ) {
2018-07-06 16:47:34 -07:00
struct f2fs_sb_info * sbi = F2FS_SB ( sb ) ;
set_sbi_flag ( sbi , SBI_IS_CLOSE ) ;
f2fs_stop_gc_thread ( sbi ) ;
f2fs_stop_discard_thread ( sbi ) ;
2021-05-20 19:51:50 +08:00
# ifdef CONFIG_F2FS_FS_COMPRESSION
/*
* latter evict_inode ( ) can bypass checking and invalidating
* compress inode cache .
*/
if ( test_opt ( sbi , COMPRESS_CACHE ) )
truncate_inode_pages_final ( COMPRESS_MAPPING ( sbi ) ) ;
# endif
2018-07-06 16:47:34 -07:00
if ( is_sbi_flag_set ( sbi , SBI_IS_DIRTY ) | |
! is_set_ckpt_flags ( sbi , CP_UMOUNT_FLAG ) ) {
struct cp_control cpc = {
. reason = CP_UMOUNT ,
} ;
f2fs_write_checkpoint ( sbi , & cpc ) ;
}
f2fs: fix to flush all dirty inodes recovered in readonly fs
generic/417 reported as blow:
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:695!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21697 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
Call Trace:
? _raw_spin_unlock+0x2c/0x50
evict+0xa8/0x170
dispose_list+0x34/0x40
evict_inodes+0x118/0x120
generic_shutdown_super+0x41/0x100
? rcu_read_lock_sched_held+0x97/0xa0
kill_block_super+0x22/0x50
kill_f2fs_super+0x6f/0x80 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
It can simply reproduced with scripts:
Enable quota feature during mkfs.
Testcase1:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
4. godown /mnt/f2fs
5. umount /mnt/f2fs
6. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
7. umount /mnt/f2fs
Testcase2:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. touch /mnt/f2fs/file
4. create process[pid = x] do:
a) open /mnt/f2fs/file;
b) unlink /mnt/f2fs/file
5. godown -f /mnt/f2fs
6. kill process[pid = x]
7. umount /mnt/f2fs
8. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
9. umount /mnt/f2fs
The reason is: during recovery, i_{c,m}time of inode will be updated, then
the inode can be set dirty w/o being tracked in sbi->inode_list[DIRTY_META]
global list, so later write_checkpoint will not flush such dirty inode into
node page.
Once umount is called, sync_filesystem() in generic_shutdown_super() will
skip syncng dirty inodes due to sb_rdonly check, leaving dirty inodes
there.
To solve this issue, during umount, add remove SB_RDONLY flag in
sb->s_flags, to make sure sync_filesystem() will not be skipped.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-22 17:11:05 +08:00
if ( is_sbi_flag_set ( sbi , SBI_IS_RECOVERED ) & & f2fs_readonly ( sb ) )
sb - > s_flags & = ~ SB_RDONLY ;
2017-06-29 23:17:45 +08:00
}
2015-01-14 16:34:24 -08:00
kill_block_super ( sb ) ;
}
2012-11-02 17:07:47 +09:00
static struct file_system_type f2fs_fs_type = {
. owner = THIS_MODULE ,
. name = " f2fs " ,
. mount = f2fs_mount ,
2015-01-14 16:34:24 -08:00
. kill_sb = kill_f2fs_super ,
2022-02-04 13:24:56 +08:00
. fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP ,
2012-11-02 17:07:47 +09:00
} ;
2013-03-02 19:39:14 -08:00
MODULE_ALIAS_FS ( " f2fs " ) ;
2012-11-02 17:07:47 +09:00
2013-01-17 00:08:30 +09:00
static int __init init_inodecache ( void )
2012-11-02 17:07:47 +09:00
{
2016-01-14 15:18:21 -08:00
f2fs_inode_cachep = kmem_cache_create ( " f2fs_inode_cache " ,
sizeof ( struct f2fs_inode_info ) , 0 ,
SLAB_RECLAIM_ACCOUNT | SLAB_ACCOUNT , NULL ) ;
2022-11-25 19:47:36 +08:00
return f2fs_inode_cachep ? 0 : - ENOMEM ;
2012-11-02 17:07:47 +09:00
}
static void destroy_inodecache ( void )
{
/*
* Make sure all delayed rcu free inodes are flushed before we
* destroy cache .
*/
rcu_barrier ( ) ;
kmem_cache_destroy ( f2fs_inode_cachep ) ;
}
static int __init init_f2fs_fs ( void )
{
int err ;
disable loading f2fs module on PAGE_SIZE > 4KB
The following patch disables loading of f2fs module on architectures
which have PAGE_SIZE > 4096 , since it is impossible to mount f2fs on
such architectures , log messages are:
mount: /mnt: wrong fs type, bad option, bad superblock on
/dev/vdiskb1, missing codepage or helper program, or other error.
/dev/vdiskb1: F2FS filesystem,
UUID=1d8b9ca4-2389-4910-af3b-10998969f09c, volume name ""
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
filesystem in 1th superblock
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
filesystem in 2th superblock
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
which was introduced by git commit 5c9b469295fb6b10d98923eab5e79c4edb80ed20
tested on git kernel 4.17.0-rc6-00309-gec30dcf7f425
with patch applied:
modprobe: ERROR: could not insert 'f2fs': Invalid argument
May 28 01:40:28 v215 kernel: F2FS not supported on PAGE_SIZE(8192) != 4096
Signed-off-by: Anatoly Pugachev <matorola@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-28 02:06:37 +03:00
if ( PAGE_SIZE ! = F2FS_BLKSIZE ) {
printk ( " F2FS not supported on PAGE_SIZE(%lu) != %d \n " ,
PAGE_SIZE , F2FS_BLKSIZE ) ;
return - EINVAL ;
}
2012-11-02 17:07:47 +09:00
err = init_inodecache ( ) ;
if ( err )
goto fail ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_create_node_manager_caches ( ) ;
2012-11-02 17:07:47 +09:00
if ( err )
2013-08-20 16:49:51 +08:00
goto free_inodecache ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_create_segment_manager_caches ( ) ;
2012-11-02 17:07:47 +09:00
if ( err )
2013-08-20 16:49:51 +08:00
goto free_node_manager_caches ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
err = f2fs_create_checkpoint_caches ( ) ;
2012-11-02 17:07:47 +09:00
if ( err )
2014-12-29 15:56:18 +08:00
goto free_segment_manager_caches ;
2021-05-07 18:10:38 +08:00
err = f2fs_create_recovery_cache ( ) ;
f2fs: enable rb-tree extent cache
This patch enables rb-tree based extent cache in f2fs.
When we mount with "-o extent_cache", f2fs will try to add recently accessed
page-block mappings into rb-tree based extent cache as much as possible, instead
of original one extent info cache.
By this way, f2fs can support more effective cache between dnode page cache and
disk. It will supply high hit ratio in the cache with fewer memory when dnode
page cache are reclaimed in environment of low memory.
Storage: Sandisk sd card 64g
1.append write file (offset: 0, size: 128M);
2.override write file (offset: 2M, size: 1M);
3.override write file (offset: 4M, size: 1M);
...
4.override write file (offset: 48M, size: 1M);
...
5.override write file (offset: 112M, size: 1M);
6.sync
7.echo 3 > /proc/sys/vm/drop_caches
8.read file (size:128M, unit: 4k, count: 32768)
(time dd if=/mnt/f2fs/128m bs=4k count=32768)
Extent Hit Ratio:
before patched
Hit Ratio 121 / 1071 1071 / 1071
Performance:
before patched
real 0m37.051s 0m35.556s
user 0m0.040s 0m0.026s
sys 0m2.990s 0m2.251s
Memory Cost:
before patched
Tree Count: 0 1 (size: 24 bytes)
Node Count: 0 45 (size: 1440 bytes)
v3:
o retest and given more details of test result.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-05 17:57:31 +08:00
if ( err )
goto free_checkpoint_caches ;
2021-05-07 18:10:38 +08:00
err = f2fs_create_extent_cache ( ) ;
if ( err )
goto free_recovery_cache ;
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
err = f2fs_create_garbage_collection_cache ( ) ;
2017-06-14 17:39:46 +08:00
if ( err )
f2fs: enable rb-tree extent cache
This patch enables rb-tree based extent cache in f2fs.
When we mount with "-o extent_cache", f2fs will try to add recently accessed
page-block mappings into rb-tree based extent cache as much as possible, instead
of original one extent info cache.
By this way, f2fs can support more effective cache between dnode page cache and
disk. It will supply high hit ratio in the cache with fewer memory when dnode
page cache are reclaimed in environment of low memory.
Storage: Sandisk sd card 64g
1.append write file (offset: 0, size: 128M);
2.override write file (offset: 2M, size: 1M);
3.override write file (offset: 4M, size: 1M);
...
4.override write file (offset: 48M, size: 1M);
...
5.override write file (offset: 112M, size: 1M);
6.sync
7.echo 3 > /proc/sys/vm/drop_caches
8.read file (size:128M, unit: 4k, count: 32768)
(time dd if=/mnt/f2fs/128m bs=4k count=32768)
Extent Hit Ratio:
before patched
Hit Ratio 121 / 1071 1071 / 1071
Performance:
before patched
real 0m37.051s 0m35.556s
user 0m0.040s 0m0.026s
sys 0m2.990s 0m2.251s
Memory Cost:
before patched
Tree Count: 0 1 (size: 24 bytes)
Node Count: 0 45 (size: 1440 bytes)
v3:
o retest and given more details of test result.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-05 17:57:31 +08:00
goto free_extent_cache ;
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
err = f2fs_init_sysfs ( ) ;
if ( err )
goto free_garbage_collection_cache ;
2022-05-31 20:22:24 -07:00
err = register_shrinker ( & f2fs_shrinker_info , " f2fs-shrinker " ) ;
2015-05-15 15:37:24 -07:00
if ( err )
2017-06-14 17:39:46 +08:00
goto free_sysfs ;
2015-06-19 12:01:21 -07:00
err = register_filesystem ( & f2fs_fs_type ) ;
if ( err )
goto free_shrinker ;
2019-01-04 14:26:18 +01:00
f2fs_create_root_stats ( ) ;
f2fs: refactor read path to allow multiple postprocessing steps
Currently f2fs's ->readpage() and ->readpages() assume that either the
data undergoes no postprocessing, or decryption only. But with
fs-verity, there will be an additional authenticity verification step,
and it may be needed either by itself, or combined with decryption.
To support this, store a 'struct bio_post_read_ctx' in ->bi_private
which contains a work struct, a bitmask of postprocessing steps that are
enabled, and an indicator of the current step. The bio completion
routine, if there was no I/O error, enqueues the first postprocessing
step. When that completes, it continues to the next step. Pages that
fail any postprocessing step have PageError set. Once all steps have
completed, pages without PageError set are set Uptodate, and all pages
are unlocked.
Also replace f2fs_encrypted_file() with a new function
f2fs_post_read_required() in places like direct I/O and garbage
collection that really should be testing whether the file needs special
I/O processing, not whether it is encrypted specifically.
This may also be useful for other future f2fs features such as
compression.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-18 11:09:48 -07:00
err = f2fs_init_post_read_processing ( ) ;
if ( err )
goto free_root_stats ;
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
err = f2fs_init_iostat_processing ( ) ;
2019-09-30 18:53:25 +08:00
if ( err )
goto free_post_read ;
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
err = f2fs_init_bio_entry_cache ( ) ;
if ( err )
goto free_iostat ;
2019-12-04 09:52:58 +08:00
err = f2fs_init_bioset ( ) ;
if ( err )
2022-12-07 13:42:17 +00:00
goto free_bio_entry_cache ;
f2fs: introduce mempool for {,de}compress intermediate page allocation
If compression feature is on, in scenario of no enough free memory,
page refault ratio is higher than before, the root cause is:
- {,de}compression flow needs to allocate intermediate pages to store
compressed data in cluster, so during their allocation, vm may reclaim
mmaped pages.
- if above reclaimed pages belong to compressed cluster, during its
refault, it may cause more intermediate pages allocation, result in
reclaiming more mmaped pages.
So this patch introduces a mempool for intermediate page allocation,
in order to avoid high refault ratio, by default, number of
preallocated page in pool is 512, user can change the number by
assigning 'num_compress_pages' parameter during module initialization.
Ma Feng found warnings in the original patch and fixed like below.
Fix the following sparse warning:
fs/f2fs/compress.c:501:5: warning: symbol 'num_compress_pages' was not declared.
Should it be static?
fs/f2fs/compress.c:530:6: warning: symbol 'f2fs_compress_free_page' was not
declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-04-08 19:56:05 +08:00
err = f2fs_init_compress_mempool ( ) ;
if ( err )
goto free_bioset ;
2020-09-14 17:05:14 +08:00
err = f2fs_init_compress_cache ( ) ;
if ( err )
goto free_compress_mempool ;
2021-06-11 07:46:30 +08:00
err = f2fs_create_casefold_cache ( ) ;
if ( err )
goto free_compress_cache ;
2013-08-20 16:49:51 +08:00
return 0 ;
2021-06-11 07:46:30 +08:00
free_compress_cache :
f2fs_destroy_compress_cache ( ) ;
2020-09-14 17:05:14 +08:00
free_compress_mempool :
f2fs_destroy_compress_mempool ( ) ;
f2fs: introduce mempool for {,de}compress intermediate page allocation
If compression feature is on, in scenario of no enough free memory,
page refault ratio is higher than before, the root cause is:
- {,de}compression flow needs to allocate intermediate pages to store
compressed data in cluster, so during their allocation, vm may reclaim
mmaped pages.
- if above reclaimed pages belong to compressed cluster, during its
refault, it may cause more intermediate pages allocation, result in
reclaiming more mmaped pages.
So this patch introduces a mempool for intermediate page allocation,
in order to avoid high refault ratio, by default, number of
preallocated page in pool is 512, user can change the number by
assigning 'num_compress_pages' parameter during module initialization.
Ma Feng found warnings in the original patch and fixed like below.
Fix the following sparse warning:
fs/f2fs/compress.c:501:5: warning: symbol 'num_compress_pages' was not declared.
Should it be static?
fs/f2fs/compress.c:530:6: warning: symbol 'f2fs_compress_free_page' was not
declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-04-08 19:56:05 +08:00
free_bioset :
f2fs_destroy_bioset ( ) ;
2022-12-07 13:42:17 +00:00
free_bio_entry_cache :
2019-12-04 09:52:58 +08:00
f2fs_destroy_bio_entry_cache ( ) ;
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
free_iostat :
f2fs_destroy_iostat_processing ( ) ;
2019-09-30 18:53:25 +08:00
free_post_read :
f2fs_destroy_post_read_processing ( ) ;
f2fs: refactor read path to allow multiple postprocessing steps
Currently f2fs's ->readpage() and ->readpages() assume that either the
data undergoes no postprocessing, or decryption only. But with
fs-verity, there will be an additional authenticity verification step,
and it may be needed either by itself, or combined with decryption.
To support this, store a 'struct bio_post_read_ctx' in ->bi_private
which contains a work struct, a bitmask of postprocessing steps that are
enabled, and an indicator of the current step. The bio completion
routine, if there was no I/O error, enqueues the first postprocessing
step. When that completes, it continues to the next step. Pages that
fail any postprocessing step have PageError set. Once all steps have
completed, pages without PageError set are set Uptodate, and all pages
are unlocked.
Also replace f2fs_encrypted_file() with a new function
f2fs_post_read_required() in places like direct I/O and garbage
collection that really should be testing whether the file needs special
I/O processing, not whether it is encrypted specifically.
This may also be useful for other future f2fs features such as
compression.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-18 11:09:48 -07:00
free_root_stats :
f2fs_destroy_root_stats ( ) ;
2015-10-29 09:13:04 +08:00
unregister_filesystem ( & f2fs_fs_type ) ;
2015-06-19 12:01:21 -07:00
free_shrinker :
unregister_shrinker ( & f2fs_shrinker_info ) ;
2017-06-14 17:39:46 +08:00
free_sysfs :
2017-07-26 11:24:13 -07:00
f2fs_exit_sysfs ( ) ;
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
free_garbage_collection_cache :
f2fs_destroy_garbage_collection_cache ( ) ;
f2fs: enable rb-tree extent cache
This patch enables rb-tree based extent cache in f2fs.
When we mount with "-o extent_cache", f2fs will try to add recently accessed
page-block mappings into rb-tree based extent cache as much as possible, instead
of original one extent info cache.
By this way, f2fs can support more effective cache between dnode page cache and
disk. It will supply high hit ratio in the cache with fewer memory when dnode
page cache are reclaimed in environment of low memory.
Storage: Sandisk sd card 64g
1.append write file (offset: 0, size: 128M);
2.override write file (offset: 2M, size: 1M);
3.override write file (offset: 4M, size: 1M);
...
4.override write file (offset: 48M, size: 1M);
...
5.override write file (offset: 112M, size: 1M);
6.sync
7.echo 3 > /proc/sys/vm/drop_caches
8.read file (size:128M, unit: 4k, count: 32768)
(time dd if=/mnt/f2fs/128m bs=4k count=32768)
Extent Hit Ratio:
before patched
Hit Ratio 121 / 1071 1071 / 1071
Performance:
before patched
real 0m37.051s 0m35.556s
user 0m0.040s 0m0.026s
sys 0m2.990s 0m2.251s
Memory Cost:
before patched
Tree Count: 0 1 (size: 24 bytes)
Node Count: 0 45 (size: 1440 bytes)
v3:
o retest and given more details of test result.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-05 17:57:31 +08:00
free_extent_cache :
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_extent_cache ( ) ;
2021-05-07 18:10:38 +08:00
free_recovery_cache :
f2fs_destroy_recovery_cache ( ) ;
2013-08-20 16:49:51 +08:00
free_checkpoint_caches :
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_checkpoint_caches ( ) ;
2013-11-15 13:55:58 +09:00
free_segment_manager_caches :
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_segment_manager_caches ( ) ;
2013-08-20 16:49:51 +08:00
free_node_manager_caches :
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_node_manager_caches ( ) ;
2013-08-20 16:49:51 +08:00
free_inodecache :
destroy_inodecache ( ) ;
2012-11-02 17:07:47 +09:00
fail :
return err ;
}
static void __exit exit_f2fs_fs ( void )
{
2021-06-11 07:46:30 +08:00
f2fs_destroy_casefold_cache ( ) ;
2020-09-14 17:05:14 +08:00
f2fs_destroy_compress_cache ( ) ;
f2fs: introduce mempool for {,de}compress intermediate page allocation
If compression feature is on, in scenario of no enough free memory,
page refault ratio is higher than before, the root cause is:
- {,de}compression flow needs to allocate intermediate pages to store
compressed data in cluster, so during their allocation, vm may reclaim
mmaped pages.
- if above reclaimed pages belong to compressed cluster, during its
refault, it may cause more intermediate pages allocation, result in
reclaiming more mmaped pages.
So this patch introduces a mempool for intermediate page allocation,
in order to avoid high refault ratio, by default, number of
preallocated page in pool is 512, user can change the number by
assigning 'num_compress_pages' parameter during module initialization.
Ma Feng found warnings in the original patch and fixed like below.
Fix the following sparse warning:
fs/f2fs/compress.c:501:5: warning: symbol 'num_compress_pages' was not declared.
Should it be static?
fs/f2fs/compress.c:530:6: warning: symbol 'f2fs_compress_free_page' was not
declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-04-08 19:56:05 +08:00
f2fs_destroy_compress_mempool ( ) ;
2019-12-04 09:52:58 +08:00
f2fs_destroy_bioset ( ) ;
2019-09-30 18:53:25 +08:00
f2fs_destroy_bio_entry_cache ( ) ;
f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-20 15:29:09 -07:00
f2fs_destroy_iostat_processing ( ) ;
f2fs: refactor read path to allow multiple postprocessing steps
Currently f2fs's ->readpage() and ->readpages() assume that either the
data undergoes no postprocessing, or decryption only. But with
fs-verity, there will be an additional authenticity verification step,
and it may be needed either by itself, or combined with decryption.
To support this, store a 'struct bio_post_read_ctx' in ->bi_private
which contains a work struct, a bitmask of postprocessing steps that are
enabled, and an indicator of the current step. The bio completion
routine, if there was no I/O error, enqueues the first postprocessing
step. When that completes, it continues to the next step. Pages that
fail any postprocessing step have PageError set. Once all steps have
completed, pages without PageError set are set Uptodate, and all pages
are unlocked.
Also replace f2fs_encrypted_file() with a new function
f2fs_post_read_required() in places like direct I/O and garbage
collection that really should be testing whether the file needs special
I/O processing, not whether it is encrypted specifically.
This may also be useful for other future f2fs features such as
compression.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-18 11:09:48 -07:00
f2fs_destroy_post_read_processing ( ) ;
2013-01-15 19:58:47 +09:00
f2fs_destroy_root_stats ( ) ;
2012-11-02 17:07:47 +09:00
unregister_filesystem ( & f2fs_fs_type ) ;
2016-05-18 08:02:25 +08:00
unregister_shrinker ( & f2fs_shrinker_info ) ;
2017-07-26 11:24:13 -07:00
f2fs_exit_sysfs ( ) ;
f2fs: support age threshold based garbage collection
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-08-04 21:14:49 +08:00
f2fs_destroy_garbage_collection_cache ( ) ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_extent_cache ( ) ;
2021-05-07 18:10:38 +08:00
f2fs_destroy_recovery_cache ( ) ;
f2fs: clean up symbol namespace
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 00:20:41 +08:00
f2fs_destroy_checkpoint_caches ( ) ;
f2fs_destroy_segment_manager_caches ( ) ;
f2fs_destroy_node_manager_caches ( ) ;
2012-11-02 17:07:47 +09:00
destroy_inodecache ( ) ;
}
module_init ( init_f2fs_fs )
module_exit ( exit_f2fs_fs )
MODULE_AUTHOR ( " Samsung Electronics's Praesto Team " ) ;
MODULE_DESCRIPTION ( " Flash Friendly File System " ) ;
MODULE_LICENSE ( " GPL " ) ;
2021-05-18 09:57:54 +08:00
MODULE_SOFTDEP ( " pre: crc32 " ) ;
2016-11-04 14:59:15 -07:00